Feature Flags — Street-Level Ops¶
Quick Diagnosis Commands¶
# Check LaunchDarkly SDK connection status (look at startup logs)
grep -i "launchdarkly\|feature.*flag\|LD_SDK" /var/log/app/app.log | tail -50
# Check if flag evaluation is happening (add metric/log to hook)
grep "feature_flag.evaluated" /var/log/app/app.log | jq -r '.flag' | sort | uniq -c | sort -rn
# LaunchDarkly REST API: check flag status
curl -s -H "Authorization: $LD_API_KEY" \
"https://app.launchdarkly.com/api/v2/flags/my-project/new-checkout" | \
jq '{key: .key, on: .environments["production"].on, variations: .variations}'
# Flagsmith REST API: check flag for a specific user
curl -s -H "X-Environment-Key: $FLAGSMITH_ENV_KEY" \
"https://api.flagsmith.com/api/v1/identities/?identifier=user-123" | \
jq '.flags[] | select(.feature.name == "new-checkout") | {enabled: .enabled, value: .feature_state_value}'
# Check OpenFeature provider initialization in logs
kubectl logs deployment/my-service | grep -i "openfeature\|provider\|flag" | tail -20
# Emergency: see all evaluations for a specific flag (if you log them)
kubectl logs deployment/my-service --since=1h | \
jq -r 'select(.flag == "new-checkout") | "\(.user_id) \(.value) \(.reason)"' | \
sort | uniq -c | sort -rn
Gotcha: Stale SDK Cache¶
Rule: The LaunchDarkly SDK caches flag state locally. After you change a flag in the UI, it can take 30 seconds to 2 minutes for all instances to pick up the change (streaming connection) or longer if polling.
# Check SDK connection mode
client = ldclient.get()
# Streaming (real-time, default): changes propagate in ~1 second
# Polling (fallback): changes propagate every polling_interval (default 30s)
config = Config(
sdk_key="sdk-xxx",
stream=True, # default, uses SSE streaming
# stream=False, # polling mode
polling_interval=30, # seconds (only for polling mode)
)
# Check if SDK is initialized and connected
print(client.is_initialized()) # True = SDK got flags from LD
# False = using defaults only (network issue, wrong SDK key, SDK not ready)
If SDK fails to connect: all variation() calls return the default value (second argument). Design defaults to be the safe/conservative behavior.
Gotcha: Default Value Is Your Production Fallback¶
Rule: If the flag evaluation fails (SDK not initialized, network partition, LD outage), users get the default value. Design accordingly.
# BAD: default is True — if LD goes down, everyone gets the beta feature
if client.variation("experimental-payment-v2", user_ctx, True):
return new_payment_flow()
# GOOD: default is False — if LD goes down, everyone gets the stable flow
if client.variation("experimental-payment-v2", user_ctx, False):
return new_payment_flow()
return legacy_payment_flow()
# GOOD: default is True for a kill switch (default = enabled, flag disables it)
if client.variation("recommendations-enabled", user_ctx, True):
return recommendation_engine.get(user_id)
return fallback_recommendations()
Pattern: Safe Percentage Rollout¶
1% → monitor for 24h → no regressions? Continue
5% → monitor for 24h
10% → monitor for 48h (enough traffic to detect p99 issues)
25% → monitor for 24h
50% → monitor for 24h
100% → stable for 1 week → remove flag and old code
# What to monitor during rollout
# Error rate by flag variation (requires flagging your metrics)
# SELECT variation, count(*), avg(response_time), sum(errors)/count(*) as error_rate
# FROM requests
# WHERE flag_key = 'new-checkout'
# GROUP BY variation
# In Datadog/Grafana: split your SLO dashboards by flag variation
# Add flag evaluation as a tag to all metrics/traces
Pattern: Emergency Kill Switch¶
# Design pattern: every expensive operation should have a kill switch flag
class RecommendationService:
FLAG_KEY = "recommendations-enabled"
def get_recommendations(self, user_id: str) -> list:
ctx = EvaluationContext(targeting_key=user_id)
if not self.flag_client.get_boolean_value(self.FLAG_KEY, True, ctx):
# Kill switch is off — degrade immediately, no logging spam
return self.FALLBACK_RECOMMENDATIONS
try:
return self.engine.get(user_id, timeout=2.0)
except TimeoutError:
self.metrics.increment("recommendations.timeout")
return self.FALLBACK_RECOMMENDATIONS
except Exception as e:
self.logger.error("recommendation engine error", exc_info=e)
self.metrics.increment("recommendations.error")
return self.FALLBACK_RECOMMENDATIONS
Incident response with a kill switch:
# LaunchDarkly API: turn off flag immediately (no UI needed)
curl -X PATCH \
-H "Authorization: $LD_API_KEY" \
-H "Content-Type: application/json" \
"https://app.launchdarkly.com/api/v2/flags/my-project/recommendations-enabled" \
-d '[{"op": "replace", "path": "/environments/production/on", "value": false}]'
# Verify it took effect
curl -s -H "Authorization: $LD_API_KEY" \
"https://app.launchdarkly.com/api/v2/flags/my-project/recommendations-enabled" | \
jq '.environments.production.on'
Scenario: Debugging "Why Is User X Getting the Old Experience?"¶
# Step 1: Check what variation LD thinks they should get
# Use LD debugger in the UI, or the evaluation endpoint:
curl -s -H "Authorization: $LD_API_KEY" \
"https://app.launchdarkly.com/api/v2/flags/my-project/new-checkout/evaluate" \
-d '{"user": {"key": "user-123", "custom": {"plan": "premium"}}}' | \
jq '{variation: .variation, reason: .reason}'
# Step 2: Check what the SDK is sending (add logging to your hook)
class DebugHook(Hook):
def before(self, hook_context, hints):
print(f"Evaluating {hook_context.flag_key} for {hook_context.evaluation_context}")
def after(self, hook_context, details, hints):
print(f"Result: {details.value}, Reason: {details.reason}")
# Step 3: Check if targeting is correct in the UI
# - Individual targeting: user-123 explicitly listed?
# - Rule matching: plan = "premium" rule exists and is above default?
# - Percentage rollout: user falls in which bucket?
# - Flag is ON for this environment?
# Step 4: Check that evaluation context attributes match rule expectations
# Common issue: rule says plan = "premium" but code sends plan = "Premium" (case mismatch)
Scenario: Stale Flag Cleanup Audit¶
# Find flags not updated in 90 days via LD API
curl -s -H "Authorization: $LD_API_KEY" \
"https://app.launchdarkly.com/api/v2/flags/my-project?limit=200" | \
jq --arg cutoff "$(date -d '90 days ago' +%s)000" \
'.items[] | select(.maintainerId == null or (.creationDate < ($cutoff | tonumber))) | {key: .key, created: (.creationDate / 1000 | strftime("%Y-%m-%d"))}'
# Find flags where 100% of traffic gets one variation (ready to remove)
curl -s -H "Authorization: $LD_API_KEY" \
"https://app.launchdarkly.com/api/v2/flags/my-project?limit=200" | \
jq '.items[] | select(.environments.production.fallthrough.variation != null and (.environments.production.rules | length) == 0 and (.environments.production.targets | length) == 0) | {key: .key, variation: .environments.production.fallthrough.variation}'
# Search codebase for flag references (using LD code references CLI)
ld-find-code-refs \
--accessToken="$LD_API_KEY" \
--projKey=my-project \
--dir=/path/to/repo \
--dryRun # show what would be reported
Emergency: SDK Not Initializing (All Users Get Default)¶
import ldclient
from ldclient.config import Config
import logging
logging.basicConfig(level=logging.DEBUG) # enable SDK debug logs temporarily
config = Config(
sdk_key=os.environ["LD_SDK_KEY"],
# Increase init wait if startup is slow
start_wait=10, # seconds to wait for first flag data
)
ldclient.set_config(config)
client = ldclient.get()
# Check if initialized
if not client.is_initialized():
# Causes:
# 1. Wrong SDK key (check environment — prod key used in dev?)
# 2. Network blocked: can't reach app.launchdarkly.com (check proxy/firewall)
# 3. SDK key for wrong environment
print("WARNING: LD not initialized — all flags returning defaults")
# Test connectivity manually
import urllib.request
try:
urllib.request.urlopen("https://app.launchdarkly.com", timeout=5)
print("LD reachable")
except Exception as e:
print(f"LD NOT reachable: {e}")
# -> Check proxy, DNS, firewall rules
Useful One-Liners¶
# List all flags for a project (names and on/off status in production)
curl -s -H "Authorization: $LD_API_KEY" \
"https://app.launchdarkly.com/api/v2/flags/my-project" | \
jq -r '.items[] | "\(.key)\t\(.environments.production.on)"'
# Count evaluations by flag and variation (if you emit metrics)
kubectl logs -l app=my-service --since=1h | \
jq -r 'select(.msg == "flag_evaluated") | "\(.flag):\(.variation)"' | \
sort | uniq -c | sort -rn
# Quick flag toggle via curl (turn a flag ON in production)
FLAG=my-flag-key; ENV=production
curl -X PATCH -H "Authorization: $LD_API_KEY" -H "Content-Type: application/json" \
"https://app.launchdarkly.com/api/v2/flags/my-project/$FLAG" \
-d "[{\"op\": \"replace\", \"path\": \"/environments/$ENV/on\", \"value\": true}]"
# Check OpenFeature provider type in Go app logs
kubectl logs deployment/api-server | grep -i "provider\|openfeature" | head -5
# Find dead code paths by checking if a flag is always one value
grep -r "new-checkout" src/ | grep -v test | grep -v ".pyc"
# If all references are guarded by the same variation, the other path is dead code
# Verify flag evaluation context is correct in requests
curl -s http://localhost:8000/api/checkout \
-H "X-User-ID: test-user-123" \
-H "X-User-Plan: premium" \
-v 2>&1 | grep -i "x-flag-\|variation\|feature"