Skip to content

Kubernetes Storage - Street-Level Ops

Real-world workflows for provisioning, debugging, and managing persistent storage in production.

Check Storage State

# List all PVCs across namespaces with status
kubectl get pvc -A --sort-by=.spec.resources.requests.storage
# NAMESPACE    NAME               STATUS   VOLUME       CAPACITY   ACCESS   STORAGECLASS   AGE
# production   data-postgres-0    Bound    pvc-abc123   100Gi      RWO      fast-ssd       60d
# production   data-postgres-1    Bound    pvc-def456   100Gi      RWO      fast-ssd       60d

# List PVs with reclaim policy
kubectl get pv --sort-by=.spec.capacity.storage
# NAME         CAPACITY   ACCESS   RECLAIM POLICY   STATUS   STORAGECLASS
# pvc-abc123   100Gi      RWO      Retain           Bound    fast-ssd

# Check available StorageClasses
kubectl get storageclass
# NAME               PROVISIONER          RECLAIM   VOLUMEBINDINGMODE      ALLOWEXPANSION
# fast-ssd (default) ebs.csi.aws.com      Delete    WaitForFirstConsumer   true
# standard           kubernetes.io/gce-pd Delete    Immediate              false

Remember: PV lifecycle mnemonic: A-B-R-F — Available, Bound, Released, Failed. A PVC binds to a PV; when the PVC is deleted, the PV goes to Released (Retain policy) or is deleted (Delete policy).

Debug PVC Stuck in Pending

# Check PVC events for the cause
kubectl describe pvc data-postgres-0 -n production
# Events:
#   Warning  ProvisioningFailed  2m  ebs.csi.aws.com  failed to provision volume: zone mismatch

# Common causes and checks:

# 1. StorageClass does not exist
kubectl get storageclass | grep fast-ssd

# 2. CSI driver not running
kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl get csidrivers

# 3. WaitForFirstConsumer — normal until a pod is scheduled
kubectl get pvc data-postgres-0 -n production -o jsonpath='{.status.phase}'
# Pending — check if any pod references this PVC

# 4. Quota exceeded
kubectl get resourcequota -n production

> **Gotcha:** `WaitForFirstConsumer` PVCs stay `Pending` until a pod referencing them is scheduled. This is normal and intentional  it ensures the volume is created in the same availability zone as the pod. Do not panic and start debugging until you verify a pod is actually trying to mount it.

# 5. Zone mismatch (cloud disks are zone-local)
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}{end}'

Debug Mount Errors

# Check pod events for mount failures
kubectl describe pod postgres-0 -n production | grep -A5 "Warning"
# Warning  FailedAttachVolume  kubelet  Multi-Attach error for volume "pvc-abc123"

# Multi-attach: volume still attached to another node
# Check which node the volume is attached to
kubectl get volumeattachments | grep pvc-abc123

# Force-detach a stuck volume (last resort — may corrupt data if I/O is in-flight)
kubectl delete volumeattachment <attachment-name>

# Permission denied on mount: check fsGroup
kubectl get pod postgres-0 -n production -o jsonpath='{.spec.securityContext.fsGroup}'
# Fix: add fsGroup to pod security context matching the container user

Expand a PVC

# Check if the StorageClass allows expansion
kubectl get storageclass fast-ssd -o jsonpath='{.allowVolumeExpansion}'
# true

# Expand the PVC (online for most CSI drivers)
kubectl patch pvc data-postgres-0 -n production \
  -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Monitor expansion progress
kubectl get pvc data-postgres-0 -n production -o jsonpath='{.status.conditions}'
# May require pod restart for filesystem resize
kubectl delete pod postgres-0 -n production
# StatefulSet recreates the pod, which triggers filesystem resize on mount

Under the hood: Online expansion has two phases: the cloud provider grows the block device, then the kubelet runs resize2fs (ext4) or xfs_growfs (XFS) on next mount. If the PVC shows the new size but df inside the pod still shows the old size, the filesystem resize has not happened yet — delete the pod to trigger a remount. ```text

Default trap: Most cloud StorageClasses default to reclaimPolicy: Delete. If you delete a PVC backed by a Delete PV, the underlying cloud disk is destroyed. For databases, always create a StorageClass with reclaimPolicy: Retain so the data survives PVC deletion.

Volume Snapshots

```bash

Create a snapshot before a risky operation

cat <<'EOF' | kubectl apply -f - apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: name: pg-snap-$(date +%Y%m%d) namespace: production spec: volumeSnapshotClassName: csi-aws-snapclass source: persistentVolumeClaimName: data-postgres-0 EOF

Check snapshot status

kubectl get volumesnapshot -n production

NAME READYTOUSE RESTORESIZE AGE

pg-snap-20260315 true 100Gi 2m

Restore from snapshot to a new PVC

cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data-postgres-restored namespace: production spec: dataSource: name: pg-snap-20260315 kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.io accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 100Gi EOF ```text

StatefulSet Storage Operations

```bash

List PVCs for a StatefulSet (naming pattern: