Skip to content

Failure Pattern Catalog

A taxonomy of recurring failure shapes across production systems. Each pattern describes a "shape" you'll see again and again — learn to recognize the shape and you'll diagnose faster every time.

How to use this catalog: When debugging, match the shape of what you're seeing to a pattern family. The Tell in each pattern is the one signal that confirms "yes, this is pattern X, not pattern Y."

Pattern Index

ID Pattern Family Domains Frequency
FP-001 Inode Exhaustion Resource Exhaustion linux, storage Common
FP-002 Connection Pool Exhaustion Resource Exhaustion databases, k8s Very Common
FP-003 Disk Full (Reserved Blocks Gone) Resource Exhaustion linux Common
FP-004 OOM Without Swap Buffer Resource Exhaustion linux, k8s Common
FP-005 Cgroup Soft/Hard Limit Confusion Resource Exhaustion linux, k8s Common
FP-006 PID Exhaustion via Zombies Resource Exhaustion linux, cicd Uncommon
FP-007 tmpfs Consuming Hidden RAM Resource Exhaustion linux Uncommon
FP-008 RAID Rebuild I/O Saturation Resource Exhaustion storage, datacenter Common
FP-009 Retry Storm Thundering Herd distributed, k8s Very Common
FP-010 Cache Stampede Thundering Herd distributed, databases Common
FP-011 Restart Avalanche Thundering Herd k8s Common
FP-012 Deep Health Check Cascade Thundering Herd k8s, distributed Common
FP-013 Simultaneous Timer Expiry Thundering Herd distributed Uncommon
FP-014 Two-Node Quorum Trap Split Brain distributed, k8s Common
FP-015 Stale Leader Split Brain distributed Common
FP-016 Dual-Write Divergence Split Brain databases, distributed Common
FP-017 Clock Skew Ordering Split Brain distributed, datacenter Common
FP-018 Timeout Assumed = Not Executed Split Brain distributed Very Common
FP-019 No Circuit Breaker Cascading Failure distributed, k8s Very Common
FP-020 Missing Backpressure Cascading Failure distributed Common
FP-021 Retry Amplification Cascading Failure distributed Common
FP-022 Dependency Chain Collapse Cascading Failure distributed, k8s Common
FP-023 Thread Pool Exhaustion Cascading Failure distributed Common
FP-024 Health Check Lying Cascading Failure k8s, distributed Common
FP-025 Untested Backup Silent Corruption databases, storage Very Common
FP-026 Replication Lag at Failover Silent Corruption databases Common
FP-027 Missing Point-in-Time Recovery Silent Corruption databases Common
FP-028 Zombie Process Accumulation Silent Corruption linux, cicd Uncommon
FP-029 Deleted-But-Open File Silent Corruption linux Common
FP-030 Transaction ID Wraparound Silent Corruption databases Uncommon
FP-031 Stale Image Tag Silent Corruption k8s, cicd Common
FP-032 Rollout Hang (Zero Surge + Zero Unavailable) Configuration Landmine k8s Common
FP-033 latest Tag in Production Configuration Landmine k8s, docker Very Common
FP-034 Hardcoded Namespace Override Configuration Landmine k8s Common
FP-035 Memory Limit Equals Request Configuration Landmine k8s Common
FP-036 ndots:5 Query Amplification Configuration Landmine k8s, networking Common
FP-037 StatefulSet OrderedReady Deadlock Configuration Landmine k8s Uncommon
FP-038 PVC Reclaim Policy Delete Configuration Landmine k8s, storage Common
FP-039 STP Disabled + Loop Created Configuration Landmine networking, datacenter Uncommon
FP-040 Metric Cardinality Explosion Observability Gap observability Common
FP-041 Alerting on Restart (not Root Cause) Observability Gap k8s, observability Very Common
FP-042 Missing absent() Alert Observability Gap observability Common
FP-043 Percentile Blindness Observability Gap observability Very Common
FP-044 rate() Over Too-Short Window Observability Gap observability Common
FP-045 Unstructured Logging Observability Gap observability Very Common
FP-046 Wrong Terminal Tab Human Error Amplifier databases, linux Very Common
FP-047 Apply-Without-Reading Manifest Human Error Amplifier k8s Common
FP-048 Device Name Confusion Human Error Amplifier linux, storage Common
FP-049 Port-Forward as Permanent Fix Human Error Amplifier k8s Common
FP-050 Runbook with No Contacts Human Error Amplifier incident-command Very Common
FP-051 Missing Escalation Criteria Human Error Amplifier incident-command Very Common
FP-052 Untested Rollback Procedure Human Error Amplifier cicd, k8s Common

Pattern Families

Family Count Core Idea
Resource Exhaustion 8 A finite resource runs out; the system's failure mode is worse than the scarcity
Thundering Herd 5 Many actors attempt the same action simultaneously; the spike overwhelms the target
Split Brain 5 Two nodes believe they are authoritative; writes diverge or operations repeat
Cascading Failure 6 One component's degradation causes adjacent components to fail in a chain
Silent Corruption 7 Data, state, or behavior degrade without triggering any alert
Configuration Landmine 8 A default or setting is safe under normal conditions but catastrophic at an edge case
Observability Gap 6 Monitoring is present but systematically blind to the actual failure
Human Error Amplifier 7 A process or environment design amplifies small human errors into large incidents

Reading Order

New to production debugging? Start here: 1. FP-002 Connection Pool Exhaustion — the most common "why is everything slow?" root cause 2. FP-009 Retry Storm — the most common way a recovery makes things worse 3. FP-019 No Circuit Breaker — the structural gap that turns blips into outages 4. FP-025 Untested Backup — the disaster you only discover during the disaster 5. FP-046 Wrong Terminal Tab — the human error you will make; learn to pre-empt it

Cross-Reference with Case Studies

Many of these patterns appear in ../case-studies/. Key mappings:

Pattern Case Study
FP-001 Inode Exhaustion linux_ops/inode-exhaustion/
FP-006 PID/Zombie linux_ops/zombie-processes-accumulating/
FP-014 Two-Node Quorum ops-archaeology/14-split-brain-etcd/
FP-017 Clock Skew cross-domain/hpa-flapping-clock-skew-ntp/
FP-018 Timeout = Not Executed ops-archaeology/04-postgres-replica-lag/
FP-040 Cardinality Explosion cross-domain/grafana-empty-prometheus-networkpolicy/
FP-043 Percentile Blindness ops-archaeology/09-monitoring-gap/