Skip to content

Mental Model: Immutable Infrastructure

Category: Operational Reasoning Origin: Chad Fowler's 2013 blog post "Trash Your Servers and Burn Your Code"; the principle has roots in functional programming's immutability concept; fully operationalized through containers (Docker, 2013) and infrastructure-as-code tools (Terraform, Packer) One-liner: Never modify infrastructure in place — replace it, so every environment is built from a known specification and drift is structurally impossible.

The Model

Immutable infrastructure is the principle that once a server, container, or piece of infrastructure is deployed, it is never changed. No patches applied in place. No configuration files edited on running systems. No "quick fix" that is only made on production but not in the source template. When a change is needed — a new application version, a security patch, a configuration update — a new artifact is built from source, tested in isolation, and deployed as a replacement for the existing one. The old instance is discarded.

The model solves a specific, pervasive problem: configuration drift. In a mutable infrastructure model, servers accumulate small, undocumented changes over time. A log rotation cron job added manually. A kernel parameter tuned during an incident and never captured in Ansible. A library installed to debug a problem three months ago, still present, unknown to anyone. After two years, the production server is a palimpsest of accumulated interventions — it does not match the configuration management database, the Ansible playbooks, or the Terraform state. When something breaks, you cannot know which of a hundred undocumented changes is responsible. Reproducing the environment from scratch is difficult or impossible. "Works on production, fails everywhere else" is the symptom.

Immutable infrastructure makes drift structurally impossible because there is no mechanism for drift. You cannot modify a running Docker container's filesystem and have that modification persist — the container's writeable layer is ephemeral. You cannot patch a Kubernetes pod in place — you update the image, and the Deployment controller replaces the pods. Infrastructure built from a Packer AMI is always identical to the AMI definition at build time. Every change leaves a trail in source control, has a build artifact, and goes through the deployment pipeline. The production environment is always — by definition — identical to what was last deployed from the validated artifact.

Rollback becomes trivially safe. In a mutable infrastructure world, rolling back means un-applying a series of changes, some of which may have had side effects. Did the database migration get applied? Was the log format changed? Which services were restarted? Rollback in this model is archaeology. In an immutable model, rollback means redeploying the previous artifact version — which is a tested, known-good state. The entire deployment history is artifact-addressable: any version can be redeployed at any time with full fidelity. This property transforms incident response; "rollback the deployment" becomes a two-minute operation, not a thirty-minute investigation.

The model does have a boundary condition: state. Infrastructure that manages stateful workloads — databases, message queues, file systems — cannot be made fully immutable in the same way as stateless application servers. The application tier is an excellent candidate for immutability; the data tier requires complementary strategies (database migrations as code, point-in-time recovery, backup and restore validation). The practical immutable architecture keeps the application tier immutable and stateless, with state pushed to managed services or dedicated stateful layers with their own durability guarantees.

Visual

MUTABLE INFRASTRUCTURE (the problem)
────────────────────────────────────────────────────────────
  Day 1 deployment:
    Server: Ubuntu 22.04, nginx 1.22, app v1.0
    Config: as specified in Ansible

  Day 30 (after various incidents and changes):
    Server: Ubuntu 22.04, nginx 1.22 + manual patch,
            app v1.4 + hotfix applied via ssh,
            new cron job added by @alice during incident,
            kernel param net.core.somaxconn changed by @bob,
            extra Python library installed for debugging
    Config: DIVERGED from Ansible by 7 unknown changes

   Drift is invisible until something breaks.
   Reproducing the environment is impossible.
   Rollback is archaeology.

IMMUTABLE INFRASTRUCTURE (the solution)
────────────────────────────────────────────────────────────
  Source Code  Build  Artifact  Deploy  Discard

  v1.0 ──► [Dockerfile] ──► image:v1.0 ──► pods running ──► deprecated
  v1.1 ──► [Dockerfile] ──► image:v1.1 ──► pods running ──► deprecated
  v1.2 ──► [Dockerfile] ──► image:v1.2 ──► pods running (current)

  Rollback = redeploy image:v1.1
  No drift possible: every pod is built fresh from image:v1.2

DEPLOYMENT PATTERN: Blue-Green with Immutable Artifacts
────────────────────────────────────────────────────────────
  Load Balancer
              ├──► [BLUE environment: v1.1] ◄── live traffic
              └──► [GREEN environment: v1.2] ◄── built from new artifact,
                                           smoke tested, ready

  Switch traffic: 0-downtime, instantaneous
  Rollback: switch back to BLUE (still running)
  BLUE is discarded only after GREEN is stable.

TERRAFORM STATE MODEL (infrastructure immutability)
────────────────────────────────────────────────────────────
  terraform.tfstate = source of truth for deployed state

  Change = edit .tf file  terraform plan  review  terraform apply
           (never SSH to server and manually change anything)

  If drift detected: terraform plan shows it; terraform apply corrects it
  Drift cannot hide because apply always reconciles to .tf definition

When to Reach for This

  • When designing a new deployment architecture: start with immutability as the default and carve out explicit exceptions for stateful layers
  • When an incident was caused by configuration drift or an undocumented manual change — the postmortem action item is to move toward immutable infrastructure
  • When rollback from a bad deployment took more than 5 minutes — immutable infrastructure should reduce this to under 2 minutes
  • When debugging "works on my machine" or "works in staging, fails in production" — environment parity through immutable artifacts eliminates the class of bugs caused by diverged environments
  • When a long-running server has unknown accumulated state — instead of auditing it, rebuild the service from the current definition and discard the old server

When NOT to Use This

  • Do not apply immutability rigidly to stateful systems (primary databases, object storage, message queues) without complementary durability and migration strategies — replacing a running database server with a fresh one is a data loss event, not a deployment
  • Immutable infrastructure has a build-time cost: every change requires a full artifact build and deployment cycle. For extremely rapid iteration in development environments, this can be slower than in-place editing; the model is primarily a production concern
  • Do not confuse "immutable" with "unchangeable" — immutable infrastructure changes frequently; it just changes by replacement, not modification. The discipline is in the process, not in reducing change frequency
  • Avoid half-measures: an immutable application tier running on mutable servers (where someone can still SSH and change things) retains most of the problems the model is meant to solve

Applied Examples

Example 1: Firmware update causes boot loop — the immutable hardware case

A datacenter applies a firmware update to a batch of servers that causes a boot loop. In a mutable model, recovery means figuring out which firmware change caused the problem, finding a downgrade path, and applying it to each affected server — potentially with manual BIOS intervention.

In an immutable model, the firmware version is codified in the server provisioning specification (e.g., a PXE boot configuration or IPMI-driven provisioning script). The incident triggers a rollback of the provisioning specification to the prior firmware version, and the servers are re-provisioned. No manual per-server intervention. The broken firmware version is removed from the allowed set before the specification can be applied to additional servers.

The deeper lesson: the firmware update was applied without going through the immutable provisioning pipeline — it was applied directly. The contributing factor was the existence of a manual change path. The action item is to remove the manual path: all firmware changes go through the provisioning pipeline, with a staging gate.

Example 2: Zero-downtime application upgrade with Kubernetes

A team is deploying a new version of a payment processing API. The old version has been running in production for 6 weeks. The new version has a memory-efficiency improvement that also changes the log format.

Immutable approach: a new Docker image is built (payment-api:v2.3.1), pushed to the registry, and the Kubernetes Deployment manifest is updated with the new image tag. Kubernetes performs a rolling update — new pods are started with the new image, old pods are terminated only after the new pods pass health checks. If a new pod fails to start or fails health checks, the rollout is automatically halted; the old pods continue serving traffic. If the change looks bad after release (new memory behavior causes OOM), kubectl rollout undo deployment/payment-api redeploys payment-api:v2.3.0 in under 90 seconds.

No manual changes were made to any running container. The change history is entirely in the Git commit that updated the image tag in the Deployment manifest.

The Junior vs Senior Gap

Junior Senior
SSHes into production servers to "quickly fix" a configuration issue Refuses to modify running infrastructure; makes the change in the source template and redeploys
Has a set of "known good" servers that no one wants to touch or rebuild Treats every server as disposable; rebuilds from scratch without anxiety because the specification is in code
Runs different application versions in different environments due to accumulated drift All environments are built from the same artifact; version differences are explicit, intentional, and tracked
Rollback plan is "undo the changes we made" — which requires remembering what was changed Rollback plan is "redeploy the previous artifact" — executable in under 2 minutes with one command
Accepts "I'm not sure what's on that server" as a normal state Treats any server whose state is unknown as a reliability liability to be rebuilt
Applies security patches manually to each server in sequence Security patches are a base image update; all services using that image are rebuilt and redeployed from the new base

Connections

  • Complements: Cattle vs Pets (immutable infrastructure operationalizes the cattle model — if a server is built from a specification and is always replaceable, it is cattle by definition; pets require mutability)
  • Complements: Shift Left (immutable infrastructure's build pipeline is the mechanism that enforces shift-left — security scans, configuration validation, and compliance checks happen at image build time, not at deployment time)
  • Tensions: Speed of change (immutable infrastructure requires a build and deploy cycle for every change; teams under extreme time pressure during incidents may feel the pull to make manual changes; this tension must be resolved by making the pipeline fast enough that the correct path is also the fast path)
  • Topic Packs: terraform, docker, kubernetes
  • Case Studies: firmware-update-boot-loop (this model explains why the manual firmware update path was the contributing factor and why the fix required eliminating the manual path in favor of immutable provisioning)