Pattern: Deep Health Check Cascade¶
ID: FP-012 Family: Thundering Herd Frequency: Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Subtle
The Shape¶
A readiness probe checks not just whether the service itself is alive, but also whether its downstream dependencies are reachable (database ping, cache check, external API call). When one downstream dependency becomes slow or unavailable, every pod's readiness probe fails simultaneously. The entire service is removed from the load balancer, converting a partial degradation (one slow dependency) into a complete service outage.
How You'll See It¶
In Kubernetes¶
/health/ready endpoint checks Redis AND Postgres AND external payment API. Postgres
is temporarily slow (high I/O from FP-008). All 20 pods' readiness probes time out;
Kubernetes removes all pod endpoints from the Service. Traffic returns 503 from the
load balancer. The partial slowdown (some queries slow) becomes a total blackout.
In Linux/Infrastructure¶
An HAProxy backend health check runs the application's internal health script, which checks all dependencies. One NFS mount becomes slow; health check times out; HAProxy marks all backends as down. The NFS slowness (affecting only NFS-backed operations) becomes a complete service outage.
In CI/CD¶
A CI pipeline has an "integration test" step that checks connectivity to all external services. One service (staging SMTP) is unreachable. The entire CI pipeline fails at the health check step, blocking deploys of unrelated services.
The Tell¶
All instances fail the health check at the same time. The health check endpoint itself is querying a downstream dependency. The failing dependency is not critical to the core service function — but the health check treats it as required.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Service is down | Dependency health check failing | Service pods are running; logs show they were removed due to readiness failure, not crash |
| Network partition | Deep health check cascade | Specific dependency check failing; direct network to service is fine |
| Database failure | Database slowness triggers health check | Database is slow but responding; queries eventually complete |
The Fix (Generic)¶
- Immediate: Modify readiness probe to check only the service itself (local HTTP ping); remove dependency checks from readiness probes.
- Short-term: Separate "liveness" (am I alive?) from "readiness" (should I receive traffic?) from "dependency health" (are my dependencies healthy?); expose dependency health as a non-probe metric.
- Long-term: Design readiness probes to check only the service's ability to handle requests (local state), not external dependencies; use Kubernetes startup probes for initial dependency checks.
Real-World Examples¶
- Example 1: Payment service readiness probe:
GET /health→ checks Stripe API. Stripe had a brief 30s degradation. All 15 payment service pods failed readiness. Checkout was completely down for 90s. - Example 2: API gateway readiness checked all 8 downstream microservices. One service had a slow deploy. Gateway readiness failed; all API traffic returned 503 for 4 minutes during a 2-minute downstream deploy.
War Story¶
We were proud of our "comprehensive" health checks. Our readiness probe checked the DB, Redis, the internal config service, and the external auth provider. One day the config service had a 60-second blip. All 30 of our API pods went unready simultaneously. Every service behind our API went down. The config service came back; our pods became ready again. Total impact: 90 seconds of total blackout from a 60-second blip in a non-critical service. We immediately stripped the readiness probe to just "can I respond to HTTP?" and moved the dependency checks to a separate non-blocking metrics endpoint.
Cross-References¶
- Topic Packs: k8s-ops, distributed-systems
- Footguns: k8s-ops/footguns.md — "Deep health checks on non-critical dependencies"
- Related Patterns: FP-024 (health check lying — the inverse: health check says healthy when it isn't), FP-022 (dependency chain collapse — structural cousin)