Skip to content

Kubernetes Storage - Primer

Why This Matters

Stateless pods are easy. The moment your workload needs to persist data — databases, message queues, file uploads, ML model checkpoints — you enter Kubernetes storage. Get it wrong and you face data loss, stuck deployments, or pods that refuse to schedule. Storage is where Kubernetes stops being abstract and starts touching real disks, real drivers, and real failure modes.

Core Concepts

PersistentVolumes (PV)

A PersistentVolume is a cluster-level resource representing a piece of storage. It exists independently of any pod. Think of it as the "disk" that the cluster knows about:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-fast-01
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: fast-ssd
  csi:
    driver: ebs.csi.aws.com
    volumeHandle: vol-0abc123def456

PVs can be provisioned statically (admin creates them ahead of time) or dynamically (created on demand via StorageClasses).

PersistentVolumeClaims (PVC)

A PVC is a namespace-scoped request for storage. Pods never reference PVs directly — they reference PVCs, and Kubernetes binds PVCs to suitable PVs:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-postgres-0
  namespace: production
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 50Gi

The binding is based on access mode, storage class, capacity, and label selectors. Once bound, the relationship is exclusive — no other PVC can claim that PV.

StorageClasses

A StorageClass defines how storage is provisioned. It names a provisioner, sets parameters, and configures reclaim behavior:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "5000"
  throughput: "250"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
Field Purpose
provisioner Which CSI driver or in-tree plugin creates volumes
parameters Provider-specific knobs (disk type, IOPS, encryption)
reclaimPolicy What happens to the PV when PVC is deleted
volumeBindingMode Immediate or WaitForFirstConsumer
allowVolumeExpansion Whether PVCs can request more space after creation

Dynamic Provisioning

With a StorageClass in place, you never create PVs manually. The flow:

  1. Pod references a PVC
  2. PVC references a StorageClass
  3. Kubernetes calls the CSI driver to create a volume
  4. A PV is automatically created and bound to the PVC
  5. The volume is mounted into the pod

WaitForFirstConsumer delays provisioning until a pod actually needs the volume. This ensures the volume is created in the same availability zone as the node — critical for cloud providers where disks are zone-local.

Access Modes

Mode Abbreviation Description
ReadWriteOnce RWO Mounted read-write by a single node
ReadOnlyMany ROX Mounted read-only by many nodes
ReadWriteMany RWX Mounted read-write by many nodes
ReadWriteOncePod RWOP Mounted read-write by a single pod (K8s 1.27+)

RWO is the most common — block storage (EBS, Azure Disk, GCE PD) is inherently single-attach. RWX requires a shared filesystem (NFS, EFS, CephFS, Azure Files). ROX is useful for distributing config or reference data to many pods.

Common mistake: assuming RWO means single-pod. RWO means single-node — multiple pods on the same node can all mount an RWO volume.

Reclaim Policies

Policy Behavior
Retain PV persists after PVC deletion; admin must manually reclaim
Delete PV and underlying storage are deleted when PVC is deleted
Recycle Deprecated. Was rm -rf /thevolume/* and made PV available again

For production databases, use Retain. For ephemeral workloads, Delete is fine. Never rely on Recycle.

Under the hood: The Container Storage Interface (CSI) specification was created in December 2017 by engineers from Google, Red Hat, VMware, and others to decouple storage from the Kubernetes codebase. Before CSI, storage vendors had to submit code directly to the Kubernetes repository (called "in-tree" plugins), which slowed both Kubernetes releases and vendor iteration. CSI moved storage drivers "out-of-tree" — vendors ship their own container images that implement a gRPC interface. CSI reached GA in Kubernetes 1.13 (December 2018). The old in-tree plugins (like kubernetes.io/aws-ebs) are being migrated to CSI drivers and will eventually be removed.

CSI Drivers

The Container Storage Interface (CSI) is the standard for connecting storage systems to Kubernetes. Each cloud or storage vendor ships a CSI driver:

Provider Driver
AWS EBS ebs.csi.aws.com
AWS EFS efs.csi.aws.com
GCE PD pd.csi.storage.gke.io
Azure Disk disk.csi.azure.com
Azure Files file.csi.azure.com
Ceph RBD rbd.csi.ceph.com
NFS nfs.csi.k8s.io

CSI drivers run as DaemonSets (node plugin) and Deployments (controller). Check driver health:

kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl get csinodes
kubectl get csidrivers

Volume Snapshots

Volume snapshots allow point-in-time copies of PVCs, useful for backups and cloning:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: pg-data-snap-20260315
spec:
  volumeSnapshotClassName: csi-aws-snapclass
  source:
    persistentVolumeClaimName: data-postgres-0

Restore from a snapshot by referencing it in a new PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-postgres-restored
spec:
  dataSource:
    name: pg-data-snap-20260315
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-ssd
  resources:
    requests:
      storage: 50Gi

Volume snapshots require a CSI driver that supports them and a VolumeSnapshotClass.

StatefulSet Storage Patterns

StatefulSets use volumeClaimTemplates to create one PVC per replica. Each PVC follows the naming pattern <template-name>-<statefulset-name>-<ordinal>:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  replicas: 3
  serviceName: postgres
  template:
    spec:
      containers:
        - name: postgres
          volumeMounts:
            - name: pgdata
              mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
    - metadata:
        name: pgdata
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: fast-ssd
        resources:
          requests:
            storage: 100Gi

This creates pgdata-postgres-0, pgdata-postgres-1, pgdata-postgres-2. When a StatefulSet pod is rescheduled, it reattaches to its existing PVC — this is how databases survive pod restarts.

Key behaviors: - Scaling down does NOT delete PVCs (data is preserved) - Deleting the StatefulSet does NOT delete PVCs - PVCs must be deleted manually to reclaim storage - Scaling back up reattaches existing PVCs by ordinal

Gotcha: WaitForFirstConsumer is almost always the right choice, but it causes confusion during debugging. A PVC in WaitForFirstConsumer mode will stay in Pending state indefinitely until a pod actually references it and gets scheduled. This is normal behavior, not an error. The PVC events will say "waiting for first consumer to be created before binding." If you see this and there is no pod referencing the PVC, it is working as designed — the volume will not be provisioned until a pod needs it.

War story: A team deleted a StatefulSet and assumed the PVCs would be cleaned up automatically. They were not — Kubernetes deliberately preserves StatefulSet PVCs to prevent data loss. Three months later, they had 200 orphaned 100Gi PVCs costing $2,000/month on AWS EBS. The fix: audit PVCs with kubectl get pvc -A --sort-by=.metadata.creationTimestamp and delete orphans. The prevention: tag PVCs with ownership labels and run periodic cleanup scripts.

Debugging Storage Issues

PVC Stuck in Pending

This is the most common storage problem. Check the PVC events:

kubectl describe pvc <name> -n <namespace>

Common causes:

Symptom in Events Cause Fix
no persistent volumes available No matching PV exists and no StorageClass can provision one Check StorageClass exists and provisioner is healthy
waiting for first consumer WaitForFirstConsumer mode — normal until a pod is scheduled Schedule a pod that references this PVC
storageclass "xxx" not found PVC references a nonexistent StorageClass Create the StorageClass or fix the name
exceeded quota Namespace ResourceQuota limits storage Increase quota or reduce request
volume capacity insufficient Requested size exceeds what the provisioner can allocate Reduce size or check provider limits

Mount Errors

kubectl describe pod <name> -n <namespace>
# Look for "Unable to attach" or "MountVolume" warnings

kubectl get events -n <namespace> --field-selector reason=FailedMount

Common mount failures: - Multi-attach error: RWO volume still attached to another node (node drain did not complete) - Wrong filesystem: volume formatted as ext4 but pod expects xfs - Permission denied: SecurityContext UID does not match filesystem ownership; use fsGroup in the pod spec - Volume not found: underlying cloud disk was deleted outside Kubernetes

Capacity Issues

# Check PV/PVC utilization
kubectl get pv --sort-by=.spec.capacity.storage
kubectl get pvc -A --sort-by=.spec.resources.requests.storage

# Check node disk pressure
kubectl describe node <name> | grep -A5 Conditions

# Expand a PVC (if StorageClass allows it)
kubectl patch pvc data-postgres-0 -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

PVC expansion is online for most CSI drivers but may require pod restart for filesystem resize. Check allowVolumeExpansion: true on the StorageClass.

Best Practices

  1. Always use StorageClasses — avoid static PV provisioning in production
  2. Use WaitForFirstConsumer — prevents zone mismatch on cloud providers
  3. Set Retain for databasesDelete is fine for caches and temp data
  4. Size PVCs generously — expansion is possible but adds operational risk
  5. Use fsGroup in pod security context — ensures the container process owns mounted files
  6. Snapshot before upgrades — take VolumeSnapshots before database version changes
  7. Monitor PVC usage — alert on disk usage > 80% before pods hit ENOSPC
  8. Label your PVs and PVCs — makes cleanup and auditing manageable at scale
  9. Test failover — drain a node and verify pods reattach to storage on another node
  10. Document your StorageClasses — teams should know which class to use for which workload

Wiki Navigation