Feature Flags — Street-Level Ops¶

Quick Diagnosis Commands¶

# Check LaunchDarkly SDK connection status (look at startup logs)
grep -i "launchdarkly\|feature.*flag\|LD_SDK" /var/log/app/app.log | tail -50

# Check if flag evaluation is happening (add metric/log to hook)
grep "feature_flag.evaluated" /var/log/app/app.log | jq -r '.flag' | sort | uniq -c | sort -rn

# LaunchDarkly REST API: check flag status
curl -s -H "Authorization: $LD_API_KEY" \
  "https://app.launchdarkly.com/api/v2/flags/my-project/new-checkout" | \
  jq '{key: .key, on: .environments["production"].on, variations: .variations}'

# Flagsmith REST API: check flag for a specific user
curl -s -H "X-Environment-Key: $FLAGSMITH_ENV_KEY" \
  "https://api.flagsmith.com/api/v1/identities/?identifier=user-123" | \
  jq '.flags[] | select(.feature.name == "new-checkout") | {enabled: .enabled, value: .feature_state_value}'

# Check OpenFeature provider initialization in logs
kubectl logs deployment/my-service | grep -i "openfeature\|provider\|flag" | tail -20

# Emergency: see all evaluations for a specific flag (if you log them)
kubectl logs deployment/my-service --since=1h | \
  jq -r 'select(.flag == "new-checkout") | "\(.user_id) \(.value) \(.reason)"' | \
  sort | uniq -c | sort -rn

Gotcha: Stale SDK Cache¶

Rule: The LaunchDarkly SDK caches flag state locally. After you change a flag in the UI, it can take 30 seconds to 2 minutes for all instances to pick up the change (streaming connection) or longer if polling.

# Check SDK connection mode
client = ldclient.get()

# Streaming (real-time, default): changes propagate in ~1 second
# Polling (fallback): changes propagate every polling_interval (default 30s)

config = Config(
    sdk_key="sdk-xxx",
    stream=True,              # default, uses SSE streaming
    # stream=False,           # polling mode
    polling_interval=30,      # seconds (only for polling mode)
)

# Check if SDK is initialized and connected
print(client.is_initialized())   # True = SDK got flags from LD
# False = using defaults only (network issue, wrong SDK key, SDK not ready)

If SDK fails to connect: all variation() calls return the default value (second argument). Design defaults to be the safe/conservative behavior.

Gotcha: Default Value Is Your Production Fallback¶

Rule: If the flag evaluation fails (SDK not initialized, network partition, LD outage), users get the default value. Design accordingly.

# BAD: default is True — if LD goes down, everyone gets the beta feature
if client.variation("experimental-payment-v2", user_ctx, True):
    return new_payment_flow()

# GOOD: default is False — if LD goes down, everyone gets the stable flow
if client.variation("experimental-payment-v2", user_ctx, False):
    return new_payment_flow()
return legacy_payment_flow()

# GOOD: default is True for a kill switch (default = enabled, flag disables it)
if client.variation("recommendations-enabled", user_ctx, True):
    return recommendation_engine.get(user_id)
return fallback_recommendations()

Pattern: Safe Percentage Rollout¶

1%  → monitor for 24h → no regressions? Continue
5%  → monitor for 24h
10% → monitor for 48h (enough traffic to detect p99 issues)
25% → monitor for 24h
50% → monitor for 24h
100% → stable for 1 week → remove flag and old code

# What to monitor during rollout
# Error rate by flag variation (requires flagging your metrics)
# SELECT variation, count(*), avg(response_time), sum(errors)/count(*) as error_rate
# FROM requests
# WHERE flag_key = 'new-checkout'
# GROUP BY variation

# In Datadog/Grafana: split your SLO dashboards by flag variation
# Add flag evaluation as a tag to all metrics/traces

Pattern: Emergency Kill Switch¶

# Design pattern: every expensive operation should have a kill switch flag
class RecommendationService:
    FLAG_KEY = "recommendations-enabled"

    def get_recommendations(self, user_id: str) -> list:
        ctx = EvaluationContext(targeting_key=user_id)

        if not self.flag_client.get_boolean_value(self.FLAG_KEY, True, ctx):
            # Kill switch is off — degrade immediately, no logging spam
            return self.FALLBACK_RECOMMENDATIONS

        try:
            return self.engine.get(user_id, timeout=2.0)
        except TimeoutError:
            self.metrics.increment("recommendations.timeout")
            return self.FALLBACK_RECOMMENDATIONS
        except Exception as e:
            self.logger.error("recommendation engine error", exc_info=e)
            self.metrics.increment("recommendations.error")
            return self.FALLBACK_RECOMMENDATIONS

Incident response with a kill switch:

# LaunchDarkly API: turn off flag immediately (no UI needed)
curl -X PATCH \
  -H "Authorization: $LD_API_KEY" \
  -H "Content-Type: application/json" \
  "https://app.launchdarkly.com/api/v2/flags/my-project/recommendations-enabled" \
  -d '[{"op": "replace", "path": "/environments/production/on", "value": false}]'

# Verify it took effect
curl -s -H "Authorization: $LD_API_KEY" \
  "https://app.launchdarkly.com/api/v2/flags/my-project/recommendations-enabled" | \
  jq '.environments.production.on'

Scenario: Debugging "Why Is User X Getting the Old Experience?"¶

# Step 1: Check what variation LD thinks they should get
# Use LD debugger in the UI, or the evaluation endpoint:
curl -s -H "Authorization: $LD_API_KEY" \
  "https://app.launchdarkly.com/api/v2/flags/my-project/new-checkout/evaluate" \
  -d '{"user": {"key": "user-123", "custom": {"plan": "premium"}}}' | \
  jq '{variation: .variation, reason: .reason}'

# Step 2: Check what the SDK is sending (add logging to your hook)
class DebugHook(Hook):
    def before(self, hook_context, hints):
        print(f"Evaluating {hook_context.flag_key} for {hook_context.evaluation_context}")

    def after(self, hook_context, details, hints):
        print(f"Result: {details.value}, Reason: {details.reason}")

# Step 3: Check if targeting is correct in the UI
# - Individual targeting: user-123 explicitly listed?
# - Rule matching: plan = "premium" rule exists and is above default?
# - Percentage rollout: user falls in which bucket?
# - Flag is ON for this environment?

# Step 4: Check that evaluation context attributes match rule expectations
# Common issue: rule says plan = "premium" but code sends plan = "Premium" (case mismatch)

Scenario: Stale Flag Cleanup Audit¶

# Find flags not updated in 90 days via LD API
curl -s -H "Authorization: $LD_API_KEY" \
  "https://app.launchdarkly.com/api/v2/flags/my-project?limit=200" | \
  jq --arg cutoff "$(date -d '90 days ago' +%s)000" \
  '.items[] | select(.maintainerId == null or (.creationDate < ($cutoff | tonumber))) | {key: .key, created: (.creationDate / 1000 | strftime("%Y-%m-%d"))}'

# Find flags where 100% of traffic gets one variation (ready to remove)
curl -s -H "Authorization: $LD_API_KEY" \
  "https://app.launchdarkly.com/api/v2/flags/my-project?limit=200" | \
  jq '.items[] | select(.environments.production.fallthrough.variation != null and (.environments.production.rules | length) == 0 and (.environments.production.targets | length) == 0) | {key: .key, variation: .environments.production.fallthrough.variation}'

# Search codebase for flag references (using LD code references CLI)
ld-find-code-refs \
  --accessToken="$LD_API_KEY" \
  --projKey=my-project \
  --dir=/path/to/repo \
  --dryRun  # show what would be reported

Emergency: SDK Not Initializing (All Users Get Default)¶

import ldclient
from ldclient.config import Config
import logging

logging.basicConfig(level=logging.DEBUG)  # enable SDK debug logs temporarily

config = Config(
    sdk_key=os.environ["LD_SDK_KEY"],
    # Increase init wait if startup is slow
    start_wait=10,  # seconds to wait for first flag data
)

ldclient.set_config(config)
client = ldclient.get()

# Check if initialized
if not client.is_initialized():
    # Causes:
    # 1. Wrong SDK key (check environment — prod key used in dev?)
    # 2. Network blocked: can't reach app.launchdarkly.com (check proxy/firewall)
    # 3. SDK key for wrong environment
    print("WARNING: LD not initialized — all flags returning defaults")

# Test connectivity manually
import urllib.request
try:
    urllib.request.urlopen("https://app.launchdarkly.com", timeout=5)
    print("LD reachable")
except Exception as e:
    print(f"LD NOT reachable: {e}")
    # -> Check proxy, DNS, firewall rules

Useful One-Liners¶

# List all flags for a project (names and on/off status in production)
curl -s -H "Authorization: $LD_API_KEY" \
  "https://app.launchdarkly.com/api/v2/flags/my-project" | \
  jq -r '.items[] | "\(.key)\t\(.environments.production.on)"'

# Count evaluations by flag and variation (if you emit metrics)
kubectl logs -l app=my-service --since=1h | \
  jq -r 'select(.msg == "flag_evaluated") | "\(.flag):\(.variation)"' | \
  sort | uniq -c | sort -rn

# Quick flag toggle via curl (turn a flag ON in production)
FLAG=my-flag-key; ENV=production
curl -X PATCH -H "Authorization: $LD_API_KEY" -H "Content-Type: application/json" \
  "https://app.launchdarkly.com/api/v2/flags/my-project/$FLAG" \
  -d "[{\"op\": \"replace\", \"path\": \"/environments/$ENV/on\", \"value\": true}]"

# Check OpenFeature provider type in Go app logs
kubectl logs deployment/api-server | grep -i "provider\|openfeature" | head -5

# Find dead code paths by checking if a flag is always one value
grep -r "new-checkout" src/ | grep -v test | grep -v ".pyc"
# If all references are guarded by the same variation, the other path is dead code

# Verify flag evaluation context is correct in requests
curl -s http://localhost:8000/api/checkout \
  -H "X-User-ID: test-user-123" \
  -H "X-User-Plan: premium" \
  -v 2>&1 | grep -i "x-flag-\|variation\|feature"