Mental Model: Failure Domains¶

Category: System Behavior Origin: Systems engineering and high-availability design practice; institutionalized in cloud architecture through AWS Availability Zones (2006) and Google Borg/Kubernetes design One-liner: A failure domain is a set of components that fail together — good design ensures correlated failures are contained within domains and independent across them.

The Model¶

A failure domain is any boundary within which components share a common fate. When something in the domain fails, everything in the domain is at risk. Across domain boundaries, failures are independent. The core design principle: maximize the independence of failure domains so that losing one domain does not cascade to others.

The classic physical failure domains, from smallest to largest, are: individual component (NIC, disk, power supply) → server → rack (shared top-of-rack switch and power distribution unit) → power domain (shared UPS and generator) → datacenter row → datacenter → campus → geographic region. A rack switch failure takes down every server in the rack. A PDU failure takes down every rack on that PDU. A datacenter generator failure takes down the entire datacenter. The hierarchy is fixed by physical infrastructure — the design question is how to distribute your workload across it.

In cloud and Kubernetes environments, the failure domains become: pod → node → node pool → availability zone → region. A node failure takes down all pods scheduled only on that node. An AZ failure takes down all nodes in that AZ. Kubernetes concepts like topologySpreadConstraints and podAntiAffinity are mechanisms for distributing workloads across failure domains so that any one domain's failure is survivable.

Failure domains are not only physical. Software versions are a failure domain: all pods running v2.1.3 of a service share a fate if v2.1.3 has a critical bug. Configuration values are a failure domain: all services reading from the same ConfigMap or secrets store share a fate if that store is corrupted. DNS resolvers are a failure domain: all services depending on a single CoreDNS pod share a fate if that pod crashes. Identifying failure domains requires thinking across physical, network, software, and data planes.

The failure domain model has a symmetry property: within a domain, design for high availability through redundancy. Across domains, design for independence through isolation. These are different techniques. Within a rack, you use bonded links and RAID. Across racks, you use separate routing paths and no shared state. Within an AZ, you run multiple replicas. Across AZs, you use regional load balancing and avoid cross-AZ synchronous dependencies.

Visual¶

Physical Failure Domain Hierarchy:

  Geographic Region
  └── Datacenter / Cloud Region
      └── Availability Zone
          └── Power Domain (PDU / generator circuit)
              └── Network Domain (top-of-rack switch)
                  └── Server / Node
                      └── Process / Pod

A failure at any level propagates to everything below it,
but is contained at that level (does not cross to sibling domains).

Cross-AZ Distribution (Good):

  Region: us-east-1
  ┌──────────────────────────────────────────────────────┐
  │ AZ: us-east-1a    AZ: us-east-1b    AZ: us-east-1c  │
  │ ┌─────────────┐  ┌─────────────┐  ┌─────────────┐   │
  │ │ api-pod ×2  │  │ api-pod ×2  │  │ api-pod ×2  │   │
  │ │ db replica  │  │ db primary  │  │ db replica  │   │
  │ └─────────────┘  └─────────────┘  └─────────────┘   │
  └──────────────────────────────────────────────────────┘
  Losing us-east-1b: 2 of 6 api pods lost (survivable),
                     failover to replica (automated).

Single-AZ Concentration (Dangerous):

  ┌──────────────────────────────────────────────────────┐
  │ AZ: us-east-1a (all pods here)   AZ: us-east-1b (0) │
  │ ┌─────────────────────────────┐  ┌───────────────┐   │
  │ │ api-pod ×6                  │  │  (empty)      │   │
  │ │ db primary + replica        │  │               │   │
  │ └─────────────────────────────┘  └───────────────┘   │
  └──────────────────────────────────────────────────────┘
  Losing us-east-1a: 100% outage. Replicas share the fault domain.

Software Version as Failure Domain:

  Before staged rollout (all same version = one domain):
  [v2.1.3][v2.1.3][v2.1.3][v2.1.3][v2.1.3][v2.1.3]
  Bug in v2.1.3 → 100% blast radius

  During staged rollout (version diversity = separate domains):
  [v2.1.4][v2.1.4][v2.1.3][v2.1.3][v2.1.3][v2.1.3]
  Bug in v2.1.4 → 33% blast radius, detected before full rollout

flowchart TD
    R["Geographic Region"] --> DC["Datacenter / Cloud Region"]
    DC --> AZ1["Availability Zone A"]
    DC --> AZ2["Availability Zone B"]
    DC --> AZ3["Availability Zone C"]

    AZ1 --> N1["Node 1"]
    AZ1 --> N2["Node 2"]
    AZ2 --> N3["Node 3"]
    AZ2 --> N4["Node 4"]
    AZ3 --> N5["Node 5"]
    AZ3 --> N6["Node 6"]

    N1 --> P1["api-pod"]
    N2 --> P2["api-pod"]
    N3 --> P3["api-pod"]
    N4 --> P4["db-primary"]
    N5 --> P5["api-pod"]
    N6 --> P6["db-replica"]

    style AZ2 fill:#f55,color:#fff
    style N3 fill:#f55,color:#fff
    style N4 fill:#f55,color:#fff
    style P3 fill:#f55,color:#fff
    style P4 fill:#f55,color:#fff

AZ B failure (red): 2 of 6 pods lost, db failover to replica — survivable because workloads span domains.

When to Reach for This¶

When designing pod scheduling for a new service: use topologySpreadConstraints to distribute pods across AZs; name the failure domains explicitly in the design document
When investigating a network-related incident: map the failure to its physical domain (bad optic → port → switch → rack) to understand which other components share the fault
When a "single replica" service exists: any single replica is its own failure domain — its fault domain boundary is that pod's node; a node failure equals a service outage
When planning hardware maintenance: know which failure domain the maintenance affects and verify that workloads are distributed so that domain's loss is survivable
When evaluating a new dependency: ask "what failure domain does this dependency live in, and is it the same failure domain as my service?" (shared fate = single point of failure)

When NOT to Use This¶

As a guarantee rather than a probability: failure domain boundaries reduce failure correlation but do not eliminate it; shared software stacks, shared DNS, and shared control planes create hidden cross-domain coupling
When the cost of cross-domain distribution outweighs the risk: for non-critical internal tooling, running in a single AZ may be acceptable — failure domain analysis should inform a decision, not mandate a specific architecture
For performance analysis: failure domains explain what fails together, not how fast things run; use Queueing Theory and Amdahl's Law for performance modeling

Applied Examples¶

Example 1: Link Flap from a Bad Optical Transceiver¶

A server in a datacenter begins exhibiting intermittent connectivity loss. Monitoring shows packet loss and link state transitions (link flaps) on the server's uplink port.

Failure domain analysis: - Component domain: the bad optical transceiver (SFP) is in the server's NIC or the top-of-rack switch port. Replacing the SFP resolves the issue if it is the NIC; replacing the cable or switch port if it is the fiber/SFP on the switch. - Network domain: if the flapping link is the only uplink from a bonded pair, the bond failover mechanism (LACP) should catch it. The failure domain for bond failover is the two physical links — if one flaps, traffic moves to the other. - Rack domain: if the bad optic is in the top-of-rack switch itself and causes the switch to crash or wedge, all servers in the rack lose connectivity simultaneously. This is the rack's failure domain.

Diagnosing which failure domain is affected determines the scope of impact and the fix. A single bad SFP is a component-level failure; a crashing switch is a rack-level failure. The same symptom (link flaps on one server) can have wildly different blast radii depending on which layer the fault is in.

Example 2: Network Bond Failover Not Working¶

Two servers are connected via a bonded pair of links (LACP) for redundancy. The design intent is that the two links are in different failure domains — different physical paths, different switch ports.

Investigation reveals that both bond links are connected to the same top-of-rack switch. The "redundant" links share the same network failure domain — the rack switch. When the switch fails, both bonds fail simultaneously, and the failover provides no protection.

True redundancy requires: - Link 1 → Switch A (ToR switch in row 1) - Link 2 → Switch B (ToR switch in row 2, different power domain)

With this layout, the two links are in different failure domains. Switch A failing leaves Link 2 up; Switch B failing leaves Link 1 up. The failure domains are genuinely independent.

This is one of the most common misconfigurations in datacenter networking: redundancy that exists on paper but does not cross failure domain boundaries in practice.

The Junior vs Senior Gap¶

Junior	Senior
Deploys 3 replicas to achieve high availability	Deploys 3 replicas and verifies they are scheduled across 3 different AZs (or nodes in different failure domains)
Treats "we have a replica" as sufficient redundancy	Asks "do the primary and replica share a failure domain?" — co-located replicas offer no protection against the domain's fault
Sees network redundancy as the presence of two cables	Traces both cables to verify they enter different switches, circuits, and racks
Unaware that software version is a failure domain	Uses staged rollouts and version pinning to create version diversity, separating the failure domain of old and new code

Connections¶

Complements: Blast Radius — failure domains define the structural boundaries that contain blast radius; blast radius is the quantitative outcome when a failure domain's boundary is crossed
Complements: Swiss Cheese Model — each failure domain boundary acts as a protective layer; coupling across failure domains is equivalent to two Swiss cheese slices sharing holes in the same location
Tensions: CAP Theorem — distributing across failure domains (AZs, regions) increases the probability of network partitions between those domains; the CAP tradeoff becomes more consequential as you spread further across failure domain boundaries
Topic Packs: networking, kubernetes
Case Studies: link-flaps-bad-optic (bad optic is a component failure domain; the impact depends on whether failover crosses the failure domain boundary), bonding-failover-not-working (bond links sharing a switch share a failure domain, defeating the redundancy design)