Failure Pattern Catalog¶
A taxonomy of recurring failure shapes across production systems. Each pattern describes a "shape" you'll see again and again — learn to recognize the shape and you'll diagnose faster every time.
How to use this catalog: When debugging, match the shape of what you're seeing to a pattern family. The Tell in each pattern is the one signal that confirms "yes, this is pattern X, not pattern Y."
Pattern Index¶
| ID | Pattern | Family | Domains | Frequency |
|---|---|---|---|---|
| FP-001 | Inode Exhaustion | Resource Exhaustion | linux, storage | Common |
| FP-002 | Connection Pool Exhaustion | Resource Exhaustion | databases, k8s | Very Common |
| FP-003 | Disk Full (Reserved Blocks Gone) | Resource Exhaustion | linux | Common |
| FP-004 | OOM Without Swap Buffer | Resource Exhaustion | linux, k8s | Common |
| FP-005 | Cgroup Soft/Hard Limit Confusion | Resource Exhaustion | linux, k8s | Common |
| FP-006 | PID Exhaustion via Zombies | Resource Exhaustion | linux, cicd | Uncommon |
| FP-007 | tmpfs Consuming Hidden RAM | Resource Exhaustion | linux | Uncommon |
| FP-008 | RAID Rebuild I/O Saturation | Resource Exhaustion | storage, datacenter | Common |
| FP-009 | Retry Storm | Thundering Herd | distributed, k8s | Very Common |
| FP-010 | Cache Stampede | Thundering Herd | distributed, databases | Common |
| FP-011 | Restart Avalanche | Thundering Herd | k8s | Common |
| FP-012 | Deep Health Check Cascade | Thundering Herd | k8s, distributed | Common |
| FP-013 | Simultaneous Timer Expiry | Thundering Herd | distributed | Uncommon |
| FP-014 | Two-Node Quorum Trap | Split Brain | distributed, k8s | Common |
| FP-015 | Stale Leader | Split Brain | distributed | Common |
| FP-016 | Dual-Write Divergence | Split Brain | databases, distributed | Common |
| FP-017 | Clock Skew Ordering | Split Brain | distributed, datacenter | Common |
| FP-018 | Timeout Assumed = Not Executed | Split Brain | distributed | Very Common |
| FP-019 | No Circuit Breaker | Cascading Failure | distributed, k8s | Very Common |
| FP-020 | Missing Backpressure | Cascading Failure | distributed | Common |
| FP-021 | Retry Amplification | Cascading Failure | distributed | Common |
| FP-022 | Dependency Chain Collapse | Cascading Failure | distributed, k8s | Common |
| FP-023 | Thread Pool Exhaustion | Cascading Failure | distributed | Common |
| FP-024 | Health Check Lying | Cascading Failure | k8s, distributed | Common |
| FP-025 | Untested Backup | Silent Corruption | databases, storage | Very Common |
| FP-026 | Replication Lag at Failover | Silent Corruption | databases | Common |
| FP-027 | Missing Point-in-Time Recovery | Silent Corruption | databases | Common |
| FP-028 | Zombie Process Accumulation | Silent Corruption | linux, cicd | Uncommon |
| FP-029 | Deleted-But-Open File | Silent Corruption | linux | Common |
| FP-030 | Transaction ID Wraparound | Silent Corruption | databases | Uncommon |
| FP-031 | Stale Image Tag | Silent Corruption | k8s, cicd | Common |
| FP-032 | Rollout Hang (Zero Surge + Zero Unavailable) | Configuration Landmine | k8s | Common |
| FP-033 | latest Tag in Production |
Configuration Landmine | k8s, docker | Very Common |
| FP-034 | Hardcoded Namespace Override | Configuration Landmine | k8s | Common |
| FP-035 | Memory Limit Equals Request | Configuration Landmine | k8s | Common |
| FP-036 | ndots:5 Query Amplification | Configuration Landmine | k8s, networking | Common |
| FP-037 | StatefulSet OrderedReady Deadlock | Configuration Landmine | k8s | Uncommon |
| FP-038 | PVC Reclaim Policy Delete | Configuration Landmine | k8s, storage | Common |
| FP-039 | STP Disabled + Loop Created | Configuration Landmine | networking, datacenter | Uncommon |
| FP-040 | Metric Cardinality Explosion | Observability Gap | observability | Common |
| FP-041 | Alerting on Restart (not Root Cause) | Observability Gap | k8s, observability | Very Common |
| FP-042 | Missing absent() Alert |
Observability Gap | observability | Common |
| FP-043 | Percentile Blindness | Observability Gap | observability | Very Common |
| FP-044 | rate() Over Too-Short Window |
Observability Gap | observability | Common |
| FP-045 | Unstructured Logging | Observability Gap | observability | Very Common |
| FP-046 | Wrong Terminal Tab | Human Error Amplifier | databases, linux | Very Common |
| FP-047 | Apply-Without-Reading Manifest | Human Error Amplifier | k8s | Common |
| FP-048 | Device Name Confusion | Human Error Amplifier | linux, storage | Common |
| FP-049 | Port-Forward as Permanent Fix | Human Error Amplifier | k8s | Common |
| FP-050 | Runbook with No Contacts | Human Error Amplifier | incident-command | Very Common |
| FP-051 | Missing Escalation Criteria | Human Error Amplifier | incident-command | Very Common |
| FP-052 | Untested Rollback Procedure | Human Error Amplifier | cicd, k8s | Common |
Pattern Families¶
| Family | Count | Core Idea |
|---|---|---|
| Resource Exhaustion | 8 | A finite resource runs out; the system's failure mode is worse than the scarcity |
| Thundering Herd | 5 | Many actors attempt the same action simultaneously; the spike overwhelms the target |
| Split Brain | 5 | Two nodes believe they are authoritative; writes diverge or operations repeat |
| Cascading Failure | 6 | One component's degradation causes adjacent components to fail in a chain |
| Silent Corruption | 7 | Data, state, or behavior degrade without triggering any alert |
| Configuration Landmine | 8 | A default or setting is safe under normal conditions but catastrophic at an edge case |
| Observability Gap | 6 | Monitoring is present but systematically blind to the actual failure |
| Human Error Amplifier | 7 | A process or environment design amplifies small human errors into large incidents |
Reading Order¶
New to production debugging? Start here: 1. FP-002 Connection Pool Exhaustion — the most common "why is everything slow?" root cause 2. FP-009 Retry Storm — the most common way a recovery makes things worse 3. FP-019 No Circuit Breaker — the structural gap that turns blips into outages 4. FP-025 Untested Backup — the disaster you only discover during the disaster 5. FP-046 Wrong Terminal Tab — the human error you will make; learn to pre-empt it
Cross-Reference with Case Studies¶
Many of these patterns appear in ../case-studies/. Key mappings:
| Pattern | Case Study |
|---|---|
| FP-001 Inode Exhaustion | linux_ops/inode-exhaustion/ |
| FP-006 PID/Zombie | linux_ops/zombie-processes-accumulating/ |
| FP-014 Two-Node Quorum | ops-archaeology/14-split-brain-etcd/ |
| FP-017 Clock Skew | cross-domain/hpa-flapping-clock-skew-ntp/ |
| FP-018 Timeout = Not Executed | ops-archaeology/04-postgres-replica-lag/ |
| FP-040 Cardinality Explosion | cross-domain/grafana-empty-prometheus-networkpolicy/ |
| FP-043 Percentile Blindness | ops-archaeology/09-monitoring-gap/ |
Pages that link here¶
- Pattern: Alerting on Restart (Not Root Cause)
- Pattern: Apply-Without-Reading Manifest
- Pattern: Cache Stampede
- Pattern: Cgroup Soft/Hard Limit Confusion
- Pattern: Clock Skew Ordering
- Pattern: Connection Pool Exhaustion
- Pattern: Deep Health Check Cascade
- Pattern: Deleted-But-Open File
- Pattern: Device Name Confusion
- Pattern: Disk Full (Reserved Blocks Gone)
- Pattern: Dual-Write Divergence
- Pattern: Hardcoded Namespace Override
- Pattern: Health Check Lying
- Pattern: Inode Exhaustion
- Pattern: Memory Limit Equals Request