Skip to content

Pattern: Health Check Lying

ID: FP-024 Family: Cascading Failure Frequency: Common Blast Radius: Single Service Detection Difficulty: Actively Misleading

The Shape

A service's health check endpoint returns 200 OK while the service is silently failing to perform its core function. The check verifies that the HTTP server is running, not that the business logic is correct. Traffic is routed to the service; requests fail silently; the load balancer never removes the instance because the health check keeps passing. This is the inverse of FP-012 (health check too strict); here the check is too permissive.

How You'll See It

In Kubernetes

Liveness probe: GET /health → returns 200 always. Service has a broken database connection (pool exhausted). All POST /order requests fail with 500. Health check path doesn't touch the database; it just returns {"status":"ok"}. Kubernetes considers the pod healthy; it stays in the load balancer rotation; users see 500s.

In Linux/Infrastructure

HAProxy backend health check: HTTP GET /ping. Backend service has a race condition that corrupts 5% of write operations. /ping returns 200. HAProxy routes 100% of traffic to the broken backend. The corruption is invisible to monitoring.

In CI/CD

A deployment pipeline "health check" runs curl -f http://service/health. The health check passes. But the service's feature flag is misconfigured, silently disabling key functionality for 50% of users. The health check doesn't validate feature flags.

The Tell

Health check returns 200, but request error rate is elevated. The health check endpoint doesn't exercise the code paths that are failing. Errors appear in application logs but the pod/instance is never removed from rotation.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
User error or client bug Server-side silent failure Error rate elevated for all users, not just one; server logs confirm 5xx
Random error Systematic but partial failure Consistent error rate (e.g., 5%) not random spikes
Infrastructure issue Application logic failing silently Infrastructure (CPU, network, storage) healthy; specific code paths failing

The Fix (Generic)

  1. Immediate: Add a meaningful health check that exercises the critical path (DB write + read round-trip).
  2. Short-term: Distinguish liveness (am I running?) from readiness (can I serve traffic?); add a deep readiness probe that validates the critical path.
  3. Long-term: Implement SLO-based health: if error rate exceeds 1% for 60s, the readiness probe returns 503 automatically; this triggers load balancer removal without any manual intervention.

Real-World Examples

  • Example 1: Redis connection pool exhausted. GET /health → 200 (no Redis dependency). All writes (which required Redis for session lock) returned 500. Service was "healthy" according to Kubernetes for 22 minutes until a human looked at application logs.
  • Example 2: Data corruption bug in writes (5% of records). Health check returned 200. No alerts fired. Bug was discovered 3 weeks later during a data audit. 21 days of corrupted records in production.

War Story

Users were reporting "my orders aren't saving." Health checks: all green. Metrics: 200 OK rate at 97%. The 3% failure rate was below our alert threshold (we had set it at 5% — seemed reasonable). The health check just did SELECT 1 from the database, not an actual write. The bug was in the write path. For 4 hours, 3% of orders were silently dropped. Our health check was a perfect liar: technically correct (the service was up), completely wrong (the service wasn't working). We rewrote the health check to do an actual write + read round-trip to a test key. Took 10ms but told the truth.

Cross-References

  • Topic Packs: k8s-ops, observability-deep-dive
  • Footguns: k8s-ops/footguns.md — "Ignoring service health vs system health"
  • Related Patterns: FP-012 (deep health check cascade — inverse: check too strict), FP-043 (percentile blindness — error rates hidden in averages)