Kubernetes Storage - Street-Level Ops¶
Real-world workflows for provisioning, debugging, and managing persistent storage in production.
Check Storage State¶
# List all PVCs across namespaces with status
kubectl get pvc -A --sort-by=.spec.resources.requests.storage
# NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS STORAGECLASS AGE
# production data-postgres-0 Bound pvc-abc123 100Gi RWO fast-ssd 60d
# production data-postgres-1 Bound pvc-def456 100Gi RWO fast-ssd 60d
# List PVs with reclaim policy
kubectl get pv --sort-by=.spec.capacity.storage
# NAME CAPACITY ACCESS RECLAIM POLICY STATUS STORAGECLASS
# pvc-abc123 100Gi RWO Retain Bound fast-ssd
# Check available StorageClasses
kubectl get storageclass
# NAME PROVISIONER RECLAIM VOLUMEBINDINGMODE ALLOWEXPANSION
# fast-ssd (default) ebs.csi.aws.com Delete WaitForFirstConsumer true
# standard kubernetes.io/gce-pd Delete Immediate false
Remember: PV lifecycle mnemonic: A-B-R-F — Available, Bound, Released, Failed. A PVC binds to a PV; when the PVC is deleted, the PV goes to Released (Retain policy) or is deleted (Delete policy).
Debug PVC Stuck in Pending¶
# Check PVC events for the cause
kubectl describe pvc data-postgres-0 -n production
# Events:
# Warning ProvisioningFailed 2m ebs.csi.aws.com failed to provision volume: zone mismatch
# Common causes and checks:
# 1. StorageClass does not exist
kubectl get storageclass | grep fast-ssd
# 2. CSI driver not running
kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl get csidrivers
# 3. WaitForFirstConsumer — normal until a pod is scheduled
kubectl get pvc data-postgres-0 -n production -o jsonpath='{.status.phase}'
# Pending — check if any pod references this PVC
# 4. Quota exceeded
kubectl get resourcequota -n production
> **Gotcha:** `WaitForFirstConsumer` PVCs stay `Pending` until a pod referencing them is scheduled. This is normal and intentional — it ensures the volume is created in the same availability zone as the pod. Do not panic and start debugging until you verify a pod is actually trying to mount it.
# 5. Zone mismatch (cloud disks are zone-local)
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}{end}'
Debug Mount Errors¶
# Check pod events for mount failures
kubectl describe pod postgres-0 -n production | grep -A5 "Warning"
# Warning FailedAttachVolume kubelet Multi-Attach error for volume "pvc-abc123"
# Multi-attach: volume still attached to another node
# Check which node the volume is attached to
kubectl get volumeattachments | grep pvc-abc123
# Force-detach a stuck volume (last resort — may corrupt data if I/O is in-flight)
kubectl delete volumeattachment <attachment-name>
# Permission denied on mount: check fsGroup
kubectl get pod postgres-0 -n production -o jsonpath='{.spec.securityContext.fsGroup}'
# Fix: add fsGroup to pod security context matching the container user
Expand a PVC¶
# Check if the StorageClass allows expansion
kubectl get storageclass fast-ssd -o jsonpath='{.allowVolumeExpansion}'
# true
# Expand the PVC (online for most CSI drivers)
kubectl patch pvc data-postgres-0 -n production \
-p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
# Monitor expansion progress
kubectl get pvc data-postgres-0 -n production -o jsonpath='{.status.conditions}'
# May require pod restart for filesystem resize
kubectl delete pod postgres-0 -n production
# StatefulSet recreates the pod, which triggers filesystem resize on mount
Under the hood: Online expansion has two phases: the cloud provider grows the block device, then the kubelet runs
resize2fs(ext4) orxfs_growfs(XFS) on next mount. If the PVC shows the new size butdfinside the pod still shows the old size, the filesystem resize has not happened yet — delete the pod to trigger a remount. ```textDefault trap: Most cloud StorageClasses default to
reclaimPolicy: Delete. If you delete a PVC backed by aDeletePV, the underlying cloud disk is destroyed. For databases, always create a StorageClass withreclaimPolicy: Retainso the data survives PVC deletion.
Volume Snapshots¶
```bash
Create a snapshot before a risky operation¶
cat <<'EOF' | kubectl apply -f - apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: name: pg-snap-$(date +%Y%m%d) namespace: production spec: volumeSnapshotClassName: csi-aws-snapclass source: persistentVolumeClaimName: data-postgres-0 EOF
Check snapshot status¶
kubectl get volumesnapshot -n production
NAME READYTOUSE RESTORESIZE AGE¶
pg-snap-20260315 true 100Gi 2m¶
Restore from snapshot to a new PVC¶
cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data-postgres-restored namespace: production spec: dataSource: name: pg-snap-20260315 kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.io accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 100Gi EOF ```text
StatefulSet Storage Operations¶
```bash
List PVCs for a StatefulSet (naming pattern: --)¶
kubectl get pvc -n production -l app=postgres
data-postgres-0, data-postgres-1, data-postgres-2¶
Scale down does NOT delete PVCs¶
kubectl scale statefulset postgres -n production --replicas=1 kubectl get pvc -n production -l app=postgres
All 3 PVCs still exist — data preserved¶
Scale back up reattaches existing PVCs¶
kubectl scale statefulset postgres -n production --replicas=3
Delete orphaned PVCs manually after decommissioning¶
kubectl delete pvc data-postgres-2 -n production ```text
Check Disk Usage Inside Pods¶
```bash
Check filesystem usage inside the container¶
kubectl exec -it postgres-0 -n production -- df -h /var/lib/postgresql/data
Filesystem Size Used Avail Use%¶
/dev/nvme1n1 100G 72G 28G 72%¶
Find large files¶
kubectl exec -it postgres-0 -n production -- du -sh /var/lib/postgresql/data/*
Check node disk pressure¶
kubectl describe node worker-01 | grep -A5 "Conditions"
MemoryPressure False¶
DiskPressure False¶
```text
Reclaim a Retained PV¶
```bash
After PVC deletion, a Retain PV goes to Released status¶
kubectl get pv pvc-abc123
STATUS: Released¶
To reuse: remove the claimRef so it can be bound to a new PVC¶
kubectl patch pv pvc-abc123 -p '{"spec":{"claimRef":null}}'
PV is now Available and can be claimed by a matching PVC¶
```text
War story: Team deleted a StatefulSet to redeploy it, expecting PVCs to persist. They did — but the new StatefulSet used a different
volumeClaimTemplatesname. Kubernetes created brand-new PVCs instead of reattaching the old ones. The old data sat orphaned on Released PVs. Always verify the PVC naming pattern (<template>-<sts>-<ordinal>) matches before redeploying.
Quick Reference¶
- Runbook: Pvc Stuck Pending