Pattern: Dependency Chain Collapse¶

ID: FP-022 Family: Cascading Failure Frequency: Common Blast Radius: Multi-Service to Cluster-Wide Detection Difficulty: Moderate

The Shape¶

Service A depends on Service B, which depends on Service C. When Service C fails, B fails, then A fails. The failure propagates upward through the dependency chain, and the visible symptom (A is down) is far removed from the root cause (C failed). Without distributed tracing or good logging at each tier, operators debug the wrong service. The depth of the chain determines how misleading the symptom is.

How You'll See It¶

In Kubernetes¶

User-facing API → auth service → LDAP server. LDAP server has a network partition. Auth service times out. API returns 401 to all users. On-call sees "authentication is down" in alerts. Checks auth service: it's running, CPU low, no crashes. Checks API gateway: healthy. Spends 20 minutes before checking LDAP — a 3rd-tier dependency.

In Linux/Infrastructure¶

Web → App → Database → Shared NFS (for session files). NFS becomes unresponsive. Database queries that need session data stall. App server connections fill. Web server 500s. Monitoring shows "web server errors" and "database slow" but the root cause is the NFS mount that no one is watching.

In CI/CD¶

Build → artifact cache → object storage. Object storage latency spike causes artifact cache (S3-backed) to time out. Build step "download dependencies" fails. CI shows "dependency download failed" — not "S3 is slow." Team debugs the cache service for 30 minutes before checking the underlying storage.

The Tell¶

The failure appears at the top of the call stack (user-facing service). Each tier's logs show "dependency unavailable" or timeout errors. The root cause service is at the bottom of the chain: the one that has no outgoing dependency calls, just incoming ones. Distributed traces (if present) show the entire chain failing from the bottom up.

Common Misdiagnosis¶

Looks Like	But Actually	How to Tell the Difference
Top-tier service is down	Top-tier depends on a failed lower tier	Top-tier CPU/memory fine; its outgoing calls are timing out
Multiple services down simultaneously	Chain collapse propagating upward	Failures have a temporal ordering: bottom tier first, top tier last
Network failure	Specific dependency path failure	Only the call paths through the failed dependency are affected

The Fix (Generic)¶

Immediate: Follow the dependency chain downward: check each tier's outgoing calls until you find one that is failing but has no failing outgoing calls — that's the root cause.
Short-term: Add circuit breakers at each tier to prevent upstream saturation when downstream is down.
Long-term: Implement distributed tracing (OpenTelemetry); map your service dependency graph; add dedicated alerts for each tier's dependency health (not just the top-tier symptoms).

Real-World Examples¶

Example 1: E-commerce: Checkout (A) → Inventory (B) → Warehouse API (C, external). Warehouse API returned errors. Inventory service logged warnings but propagated the error. Checkout returned 500. On-call debugged Checkout for 45 minutes before finding Warehouse API in the logs.
Example 2: Kubernetes cluster: CoreDNS (A) → etcd (B) → NFS backing etcd (C, for persistent volumes). NFS became unresponsive. etcd couldn't flush WAL. CoreDNS lost backend. All pod DNS failed. "CoreDNS is down" tickets — root cause was NFS.

War Story¶

Page at 3am: "all users getting 500 errors." Checked the API gateway: running fine, logs showed database timeouts. Checked the database: running fine, logs showed filesystem errors. Checked the filesystem: NFS mount stalled waiting for the file server. Checked the file server: out of disk space (FP-003). Total time from page to root cause: 47 minutes of following the chain. With distributed tracing, a single trace would have shown NFS latency at the bottom. We added dependency health metrics for every tier the next week.

Cross-References¶

Topic Packs: distributed-systems, k8s-ops
Case Studies: cross-domain/disk-full-runaway-logs-loki/
Related Patterns: FP-019 (no circuit breaker — prevents cascade), FP-043 (percentile blindness — chain failures are hidden in averages)