Kubernetes Storage Footguns¶

Mistakes that cause data loss, stuck deployments, or storage that silently fails in production.

1. Using `Delete` reclaim policy for database PVs¶

You use the default StorageClass which has reclaimPolicy: Delete. A teammate deletes the PVC to "clean up." The underlying cloud disk is destroyed. The database data is gone.

What happens: Permanent data loss when the PVC is deleted.

Why: Delete policy destroys the backing storage (EBS volume, GCE PD) when the PVC is removed.

How to avoid: Use reclaimPolicy: Retain for any StorageClass used by databases or stateful workloads. Create separate StorageClasses for ephemeral vs persistent data.

War story: GitLab experienced data loss in 2017 when a database directory was deleted during an incident response. While the root cause was human error, the investigation revealed that their PostgreSQL data PVC used the default StorageClass with Delete reclaim policy. A separate incident where PVCs were cleaned up could have compounded the loss. The incident led to widespread adoption of Retain policies for database storage across the industry.

2. Immediate binding mode on cloud providers¶

Your StorageClass uses volumeBindingMode: Immediate. The PVC is created and a disk is provisioned in us-east-1a. The pod gets scheduled to a node in us-east-1b. The volume cannot be attached — zone mismatch. The pod is stuck in Pending.

What happens: Pod cannot mount its volume because the disk and node are in different availability zones.

Why: Immediate provisions the disk before a pod is scheduled. The scheduler does not know which zone the disk is in.

How to avoid: Always use volumeBindingMode: WaitForFirstConsumer on cloud providers. This delays provisioning until the pod is scheduled, ensuring zone alignment.

3. Deleting a StatefulSet and assuming PVCs are cleaned up¶

You delete a StatefulSet with kubectl delete statefulset postgres. The pods are gone but all PVCs remain. You recreate the StatefulSet and the old data is still there — sometimes that is good, sometimes it is stale data from a different version.

What happens: Orphaned PVCs persist, consuming storage and potentially causing confusion on recreation.

Why: Kubernetes deliberately does not delete PVCs when a StatefulSet is deleted. This is a safety feature to prevent accidental data loss.

How to avoid: Audit PVCs after StatefulSet deletion. Delete them explicitly when you are sure the data is no longer needed: kubectl delete pvc -l app=postgres -n production.

4. No fsGroup in pod security context¶

Your container runs as UID 1000 but the volume is mounted with root ownership. The app cannot write to its data directory. It crashes with "permission denied."

What happens: Application fails to start or write data, even though the volume is mounted correctly.

Why: By default, mounted volumes have root ownership. The container process needs matching permissions.

How to avoid: Set securityContext.fsGroup in the pod spec to match the container's user GID. This recursively changes ownership on mount.

5. Expanding a PVC without checking StorageClass support¶

You patch a PVC to request more space but the StorageClass has allowVolumeExpansion: false. Nothing happens. No error in the PVC. The pod eventually runs out of disk space.

What happens: Expansion silently fails. The PVC size field changes but the underlying volume does not grow.

Why: Volume expansion requires explicit opt-in via allowVolumeExpansion: true on the StorageClass.

How to avoid: Check the StorageClass before attempting expansion: kubectl get storageclass <name> -o jsonpath='{.allowVolumeExpansion}'.

6. Multi-attach error during node failure¶

A node goes down hard (no graceful drain). The RWO volume is still "attached" to the dead node. A replacement pod on another node cannot mount the volume. It sits in ContainerCreating with a multi-attach error for 6+ minutes.

What happens: Extended downtime waiting for the volume to detach from the failed node.

Why: The cloud provider's volume attachment has a timeout before it allows force-detach. Kubernetes waits for this timeout.

How to avoid: Tune node failure detection timeouts. Use --pod-eviction-timeout and cloud-specific force-detach settings. For critical workloads, consider shared storage (EFS, NFS) that supports RWX.

Debug clue: The multi-attach error message in kubectl describe pod means the volume is still attached to another node. Check kubectl get volumeattachment to see which node holds the attachment. On AWS, the default force-detach timeout for EBS is 6 minutes. You can speed this up by deleting the VolumeAttachment object, which triggers the CSI driver to force-detach: kubectl delete volumeattachment <name>.

7. Using emptyDir for data you need to keep¶

You use emptyDir for application logs or uploaded files. The pod restarts. Everything is gone. "But it worked in testing" — because the pod never restarted in testing.

What happens: Data loss on every pod restart, rescheduling, or node drain.

Why: emptyDir is tied to the pod lifecycle. When the pod is removed, the data is deleted.

How to avoid: Use PersistentVolumeClaims for any data that must survive pod restarts. Use emptyDir only for scratch space, caches, and truly ephemeral data.

8. Not monitoring PVC disk usage¶

Your database PVC is 100Gi. The database grows to 95Gi. No alerts fire. The next vacuum or WAL write fails with ENOSPC. The database crashes.

What happens: Out-of-disk-space crash with no warning.

Why: Kubernetes does not natively alert on PVC usage percentage. The kubelet monitors node disk but not individual PVC consumption.

How to avoid: Use Prometheus with kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes. Alert at 80% and 90%.

9. Snapshot taken during active writes¶

You create a VolumeSnapshot while the database is actively writing. The snapshot captures a partially written transaction. Restoring from it produces a corrupt database.

What happens: Data corruption on restore from an inconsistent snapshot.

Why: Volume snapshots are crash-consistent, not application-consistent. In-flight writes may be partially captured.

How to avoid: Quiesce the application before snapshotting (e.g., pg_start_backup() for PostgreSQL). Or use the database's native backup tools instead of volume snapshots.

Gotcha: Even crash-consistent snapshots work for some databases (PostgreSQL, MySQL InnoDB) because they have WAL/redo logs that replay on startup — similar to recovering after a power failure. But this only works if the WAL is on the same volume. If your data and WAL are on separate PVCs, snapshotting them at different moments creates an unrecoverable state. Always snapshot all volumes atomically or use the database's native backup tool.