Incident Replay: Persistent Volume Stuck Terminating¶
Setup¶
- System context: Kubernetes cluster with dynamically provisioned PVs (AWS EBS). A PVC was deleted but the PV is stuck in "Terminating" state for 2 hours. New PVCs for the same application cannot bind.
- Time: Thursday 16:30 UTC
- Your role: Platform engineer / on-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Database team cannot provision a new PVC — their StatefulSet is stuck in Pending. They need storage for a new replica. Deadline: end of day."]
What you see:
kubectl get pv shows pv-data-0 in "Terminating" status for 2 hours. kubectl get pvc shows the new claim is Pending. The storage class has reclaimPolicy: Delete but the PV is not being deleted.
Choose your action:
- A) Force delete the PV: kubectl delete pv --force --grace-period=0
- B) Check if the PV has finalizers preventing deletion
- C) Check the EBS volume status in AWS
- D) Recreate the storage class
If you chose B (recommended):¶
[Result:
kubectl get pv pv-data-0 -o jsonpath='{.metadata.finalizers}'shows["kubernetes.io/pv-protection"]. The PV protection finalizer is preventing deletion. But this finalizer should clear once no pod is using the PV. Check for bound pods. Proceed to Round 2.]
If you chose A:¶
[Result: Force delete removes the PV from Kubernetes but does NOT delete the EBS volume in AWS. Orphaned EBS volume continues to incur cost and cannot be reused.]
If you chose C:¶
[Result: EBS volume is "in-use" status. Something is still attaching to it. Partial clue but you need to check Kubernetes side first.]
If you chose D:¶
[Result: Storage class is fine. The issue is the stuck PV, not the provisioner.]
Round 2: First Triage Data¶
[Pressure cue: "PV stuck for 2+ hours. New database replica cannot start."]
What you see: The PV protection finalizer prevents deletion while a PVC is bound. But the PVC was deleted. However, a VolumeAttachment object still exists, keeping the EBS volume attached to a node. The node had a pod that was force-deleted without proper cleanup.
Choose your action: - A) Delete the VolumeAttachment object - B) Detach the EBS volume from the node in AWS - C) Check which pod/node still has the volume attached - D) Restart the CSI driver pods
If you chose C (recommended):¶
[Result:
kubectl get volumeattachments | grep pv-data-0shows the volume is attached to k8s-worker-04. But no pod on worker-04 is using this volume. The attachment is orphaned — the pod was force-deleted and the detach never completed. Proceed to Round 3.]
If you chose A:¶
[Result: Deleting the VolumeAttachment tells the CSI driver to detach. This works but is bypassing normal cleanup. May leave the EBS volume in a bad state if the detach fails.]
If you chose B:¶
[Result: AWS-side detach works but Kubernetes does not know about it. The VolumeAttachment object and PV finalizer remain. State is inconsistent.]
If you chose D:¶
[Result: CSI driver restart may trigger a reconciliation loop that cleans up orphaned attachments. Possible fix but slow and affects all volumes.]
Round 3: Root Cause Identification¶
[Pressure cue: "Orphaned VolumeAttachment found. Clean it up."]
What you see:
Root cause: A pod using this PV was force-deleted (kubectl delete pod --force --grace-period=0) during an incident 3 hours ago. Force deletion skips the normal unmount/detach lifecycle. The VolumeAttachment was left behind, and the CSI driver's reconciliation could not clean it up because the node's kubelet had already forgotten about the volume.
Choose your action: - A) Delete the VolumeAttachment and let the CSI driver handle EBS detach - B) Manually detach in AWS then delete the VolumeAttachment in Kubernetes - C) Delete the VolumeAttachment, remove the PV finalizer, then verify EBS cleanup - D) Option A is safest — let the CSI driver orchestrate the detach
If you chose A (recommended):¶
[Result:
kubectl delete volumeattachment csi-xxxxxxx. CSI driver detects the deletion, calls AWS to detach the EBS volume. EBS detaches successfully. PV finalizer clears. PV is deleted. New PVC can now bind to a fresh PV. Proceed to Round 4.]
If you chose B:¶
[Result: Works but manual AWS operations bypass the CSI driver and can cause state inconsistencies.]
If you chose C:¶
[Result: Removing the finalizer before the VolumeAttachment is cleaned up may orphan the EBS volume in AWS.]
If you chose D:¶
[Result: Same as A. Correct approach.]
Round 4: Remediation¶
[Pressure cue: "PV deleted. New PVC binding. Verify."]
Actions:
1. Verify new PVC is Bound: kubectl get pvc
2. Verify StatefulSet replica starts: kubectl get pods
3. Verify EBS volume was properly deleted in AWS (if reclaim=Delete)
4. Add a runbook warning against --force --grace-period=0 for pods with PVs
5. Add monitoring for stuck PVs (Terminating > 30 minutes)
Damage Report¶
- Total downtime: 0 (existing database replicas served traffic)
- Blast radius: New database replica delayed 2.5 hours; scaling blocked
- Optimal resolution time: 10 minutes (check finalizers -> find orphaned VolumeAttachment -> delete)
- If every wrong choice was made: 4+ hours with orphaned EBS volumes and inconsistent state
Cross-References¶
- Primer: Kubernetes Storage
- Primer: Kubernetes Ops
- Footguns: Kubernetes Storage