Skip to content

Anti-Primer: Kubernetes Storage

Everything that can go wrong, will — and in this story, it does.

The Setup

A team is deploying a StatefulSet for a database that requires persistent storage. They are new to Kubernetes storage and chose a storage class based on a blog post. Data migration begins tomorrow.

The Timeline

Hour 0: ReclaimPolicy Delete

Uses a StorageClass with reclaimPolicy: Delete for the database PVs. The deadline was looming, and this seemed like the fastest path forward. But the result is deleting the StatefulSet for a redeployment permanently deletes all persistent volumes and data.

Footgun #1: ReclaimPolicy Delete — uses a StorageClass with reclaimPolicy: Delete for the database PVs, leading to deleting the StatefulSet for a redeployment permanently deletes all persistent volumes and data.

Nobody notices yet. The engineer moves on to the next task.

Hour 1: Wrong Access Mode

Requests ReadWriteMany on a cloud block storage that only supports ReadWriteOnce. Under time pressure, the team chose speed over caution. But the result is PVC stays Pending forever; the PV cannot be provisioned with the requested access mode.

Footgun #2: Wrong Access Mode — requests ReadWriteMany on a cloud block storage that only supports ReadWriteOnce, leading to PVC stays Pending forever; the PV cannot be provisioned with the requested access mode.

The first mistake is still invisible, making the next shortcut feel justified.

Hour 2: Volume Resize Not Enabled

Needs to expand a PVC but the StorageClass does not have allowVolumeExpansion: true. Nobody pushed back because the shortcut looked harmless in the moment. But the result is cannot resize the volume online; requires data migration to a new, larger PVC.

Footgun #3: Volume Resize Not Enabled — needs to expand a PVC but the StorageClass does not have allowVolumeExpansion: true, leading to cannot resize the volume online; requires data migration to a new, larger PVC.

Pressure is mounting. The team is behind schedule and cutting more corners.

Hour 3: No Backup Before Migration

Starts the database migration without snapshotting the existing PVs. The team had gotten away with similar shortcuts before, so nobody raised a flag. But the result is migration script corrupts data; no snapshot to restore from; 48-hour recovery effort.

Footgun #4: No Backup Before Migration — starts the database migration without snapshotting the existing PVs, leading to migration script corrupts data; no snapshot to restore from; 48-hour recovery effort.

By hour 3, the compounding failures have reached critical mass. Pages fire. The war room fills up. The team scrambles to understand what went wrong while the system burns.

The Postmortem

Root Cause Chain

# Mistake Consequence Could Have Been Prevented By
1 ReclaimPolicy Delete Deleting the StatefulSet for a redeployment permanently deletes all persistent volumes and data Primer: Use reclaimPolicy: Retain for any storage class backing stateful data
2 Wrong Access Mode PVC stays Pending forever; the PV cannot be provisioned with the requested access mode Primer: Check storage class capabilities before requesting access modes
3 Volume Resize Not Enabled Cannot resize the volume online; requires data migration to a new, larger PVC Primer: Enable allowVolumeExpansion on StorageClasses from the start
4 No Backup Before Migration Migration script corrupts data; no snapshot to restore from; 48-hour recovery effort Primer: Always snapshot PVs before any data migration or schema change

Damage Report

  • Downtime: 2-4 hours of pod-level or cluster-wide disruption
  • Data loss: Risk of volume data loss if StatefulSets were affected
  • Customer impact: Intermittent 5xx errors, dropped connections, or full service outage
  • Engineering time to remediate: 10-20 engineer-hours for incident response, rollback, and postmortem
  • Reputation cost: On-call fatigue; delayed feature work; possible SLA breach notification

What the Primer Teaches

  • Footgun #1: If the engineer had read the primer, section on reclaimpolicy delete, they would have learned: Use reclaimPolicy: Retain for any storage class backing stateful data.
  • Footgun #2: If the engineer had read the primer, section on wrong access mode, they would have learned: Check storage class capabilities before requesting access modes.
  • Footgun #3: If the engineer had read the primer, section on volume resize not enabled, they would have learned: Enable allowVolumeExpansion on StorageClasses from the start.
  • Footgun #4: If the engineer had read the primer, section on no backup before migration, they would have learned: Always snapshot PVs before any data migration or schema change.

Cross-References