Anti-Primer: Kubernetes Storage¶

Everything that can go wrong, will — and in this story, it does.

The Setup¶

A team is deploying a StatefulSet for a database that requires persistent storage. They are new to Kubernetes storage and chose a storage class based on a blog post. Data migration begins tomorrow.

The Timeline¶

Hour 0: ReclaimPolicy Delete¶

Uses a StorageClass with reclaimPolicy: Delete for the database PVs. The deadline was looming, and this seemed like the fastest path forward. But the result is deleting the StatefulSet for a redeployment permanently deletes all persistent volumes and data.

Footgun #1: ReclaimPolicy Delete — uses a StorageClass with reclaimPolicy: Delete for the database PVs, leading to deleting the StatefulSet for a redeployment permanently deletes all persistent volumes and data.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Wrong Access Mode¶

Requests ReadWriteMany on a cloud block storage that only supports ReadWriteOnce. Under time pressure, the team chose speed over caution. But the result is PVC stays Pending forever; the PV cannot be provisioned with the requested access mode.

Footgun #2: Wrong Access Mode — requests ReadWriteMany on a cloud block storage that only supports ReadWriteOnce, leading to PVC stays Pending forever; the PV cannot be provisioned with the requested access mode.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Volume Resize Not Enabled¶

Needs to expand a PVC but the StorageClass does not have allowVolumeExpansion: true. Nobody pushed back because the shortcut looked harmless in the moment. But the result is cannot resize the volume online; requires data migration to a new, larger PVC.

Footgun #3: Volume Resize Not Enabled — needs to expand a PVC but the StorageClass does not have allowVolumeExpansion: true, leading to cannot resize the volume online; requires data migration to a new, larger PVC.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: No Backup Before Migration¶

Starts the database migration without snapshotting the existing PVs. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is migration script corrupts data; no snapshot to restore from; 48-hour recovery effort.

Footgun #4: No Backup Before Migration — starts the database migration without snapshotting the existing PVs, leading to migration script corrupts data; no snapshot to restore from; 48-hour recovery effort.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem¶

Root Cause Chain¶

#	Mistake	Consequence	Could Have Been Prevented By
1	ReclaimPolicy Delete	Deleting the StatefulSet for a redeployment permanently deletes all persistent volumes and data	Primer: Use reclaimPolicy: Retain for any storage class backing stateful data
2	Wrong Access Mode	PVC stays Pending forever; the PV cannot be provisioned with the requested access mode	Primer: Check storage class capabilities before requesting access modes
3	Volume Resize Not Enabled	Cannot resize the volume online; requires data migration to a new, larger PVC	Primer: Enable allowVolumeExpansion on StorageClasses from the start
4	No Backup Before Migration	Migration script corrupts data; no snapshot to restore from; 48-hour recovery effort	Primer: Always snapshot PVs before any data migration or schema change

Damage Report¶

Downtime: 2-4 hours of pod-level or cluster-wide disruption
Data loss: Risk of volume data loss if StatefulSets were affected
Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification

What the Primer Teaches¶

Footgun #1: If the engineer had read the primer, section on reclaimpolicy delete, they would have learned: Use reclaimPolicy: Retain for any storage class backing stateful data.
Footgun #2: If the engineer had read the primer, section on wrong access mode, they would have learned: Check storage class capabilities before requesting access modes.
Footgun #3: If the engineer had read the primer, section on volume resize not enabled, they would have learned: Enable allowVolumeExpansion on StorageClasses from the start.
Footgun #4: If the engineer had read the primer, section on no backup before migration, they would have learned: Always snapshot PVs before any data migration or schema change.

Cross-References¶

Primer — The right way
Footguns — The mistakes catalogued
Street Ops — How to do it in practice