Kubernetes Storage - Street-Level Ops¶

Real-world workflows for provisioning, debugging, and managing persistent storage in production.

Check Storage State¶

# List all PVCs across namespaces with status
kubectl get pvc -A --sort-by=.spec.resources.requests.storage
# NAMESPACE    NAME               STATUS   VOLUME       CAPACITY   ACCESS   STORAGECLASS   AGE
# production   data-postgres-0    Bound    pvc-abc123   100Gi      RWO      fast-ssd       60d
# production   data-postgres-1    Bound    pvc-def456   100Gi      RWO      fast-ssd       60d

# List PVs with reclaim policy
kubectl get pv --sort-by=.spec.capacity.storage
# NAME         CAPACITY   ACCESS   RECLAIM POLICY   STATUS   STORAGECLASS
# pvc-abc123   100Gi      RWO      Retain           Bound    fast-ssd

# Check available StorageClasses
kubectl get storageclass
# NAME               PROVISIONER          RECLAIM   VOLUMEBINDINGMODE      ALLOWEXPANSION
# fast-ssd (default) ebs.csi.aws.com      Delete    WaitForFirstConsumer   true
# standard           kubernetes.io/gce-pd Delete    Immediate              false

Remember: PV lifecycle mnemonic: A-B-R-F — Available, Bound, Released, Failed. A PVC binds to a PV; when the PVC is deleted, the PV goes to Released (Retain policy) or is deleted (Delete policy).

Debug PVC Stuck in Pending¶

# Check PVC events for the cause
kubectl describe pvc data-postgres-0 -n production
# Events:
#   Warning  ProvisioningFailed  2m  ebs.csi.aws.com  failed to provision volume: zone mismatch

# Common causes and checks:

# 1. StorageClass does not exist
kubectl get storageclass | grep fast-ssd

# 2. CSI driver not running
kubectl get pods -n kube-system -l app=ebs-csi-controller
kubectl get csidrivers

# 3. WaitForFirstConsumer — normal until a pod is scheduled
kubectl get pvc data-postgres-0 -n production -o jsonpath='{.status.phase}'
# Pending — check if any pod references this PVC

# 4. Quota exceeded
kubectl get resourcequota -n production

> **Gotcha:** `WaitForFirstConsumer` PVCs stay `Pending` until a pod referencing them is scheduled. This is normal and intentional — it ensures the volume is created in the same availability zone as the pod. Do not panic and start debugging until you verify a pod is actually trying to mount it.

# 5. Zone mismatch (cloud disks are zone-local)
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.topology\.kubernetes\.io/zone}{"\n"}{end}'

Debug Mount Errors¶

# Check pod events for mount failures
kubectl describe pod postgres-0 -n production | grep -A5 "Warning"
# Warning  FailedAttachVolume  kubelet  Multi-Attach error for volume "pvc-abc123"

# Multi-attach: volume still attached to another node
# Check which node the volume is attached to
kubectl get volumeattachments | grep pvc-abc123

# Force-detach a stuck volume (last resort — may corrupt data if I/O is in-flight)
kubectl delete volumeattachment <attachment-name>

# Permission denied on mount: check fsGroup
kubectl get pod postgres-0 -n production -o jsonpath='{.spec.securityContext.fsGroup}'
# Fix: add fsGroup to pod security context matching the container user

Expand a PVC¶

# Check if the StorageClass allows expansion
kubectl get storageclass fast-ssd -o jsonpath='{.allowVolumeExpansion}'
# true

# Expand the PVC (online for most CSI drivers)
kubectl patch pvc data-postgres-0 -n production \
  -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'

# Monitor expansion progress
kubectl get pvc data-postgres-0 -n production -o jsonpath='{.status.conditions}'
# May require pod restart for filesystem resize
kubectl delete pod postgres-0 -n production
# StatefulSet recreates the pod, which triggers filesystem resize on mount

Under the hood: Online expansion has two phases: the cloud provider grows the block device, then the kubelet runs resize2fs (ext4) or xfs_growfs (XFS) on next mount. If the PVC shows the new size but df inside the pod still shows the old size, the filesystem resize has not happened yet — delete the pod to trigger a remount. ```text

Default trap: Most cloud StorageClasses default to reclaimPolicy: Delete. If you delete a PVC backed by a Delete PV, the underlying cloud disk is destroyed. For databases, always create a StorageClass with reclaimPolicy: Retain so the data survives PVC deletion.

Volume Snapshots¶

```bash

Create a snapshot before a risky operation¶

cat <<'EOF' | kubectl apply -f - apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshot metadata: name: pg-snap-$(date +%Y%m%d) namespace: production spec: volumeSnapshotClassName: csi-aws-snapclass source: persistentVolumeClaimName: data-postgres-0 EOF

Check snapshot status¶

kubectl get volumesnapshot -n production

NAME READYTOUSE RESTORESIZE AGE¶

pg-snap-20260315 true 100Gi 2m¶

Restore from snapshot to a new PVC¶

cat <<'EOF' | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data-postgres-restored namespace: production spec: dataSource: name: pg-snap-20260315 kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.io accessModes: ["ReadWriteOnce"] storageClassName: fast-ssd resources: requests: storage: 100Gi EOF ```text

StatefulSet Storage Operations¶

```bash

List PVCs for a StatefulSet (naming pattern:

Kubernetes Storage - Street-Level Ops¶

Check Storage State¶

Debug PVC Stuck in Pending¶

Debug Mount Errors¶

Expand a PVC¶

Volume Snapshots¶

Create a snapshot before a risky operation¶

Check snapshot status¶

NAME READYTOUSE RESTORESIZE AGE¶

pg-snap-20260315 true 100Gi 2m¶

Restore from snapshot to a new PVC¶

StatefulSet Storage Operations¶

List PVCs for a StatefulSet (naming pattern:

data-postgres-0, data-postgres-1, data-postgres-2¶

Scale down does NOT delete PVCs¶

All 3 PVCs still exist — data preserved¶

Scale back up reattaches existing PVCs¶

Delete orphaned PVCs manually after decommissioning¶

Check Disk Usage Inside Pods¶

Check filesystem usage inside the container¶

Filesystem Size Used Avail Use%¶

/dev/nvme1n1 100G 72G 28G 72%¶

Find large files¶

Check node disk pressure¶

MemoryPressure False¶

DiskPressure False¶

Reclaim a Retained PV¶

After PVC deletion, a Retain PV goes to Released status¶

STATUS: Released¶

To reuse: remove the claimRef so it can be bound to a new PVC¶

PV is now Available and can be claimed by a matching PVC¶

Quick Reference¶

Pages that link here¶