Incident Replay: Persistent Volume Stuck Terminating¶

Setup¶

System context: Kubernetes cluster with dynamically provisioned PVs (AWS EBS). A PVC was deleted but the PV is stuck in "Terminating" state for 2 hours. New PVCs for the same application cannot bind.
Time: Thursday 16:30 UTC
Your role: Platform engineer / on-call SRE

Round 1: Alert Fires¶

[Pressure cue: "Database team cannot provision a new PVC — their StatefulSet is stuck in Pending. They need storage for a new replica. Deadline: end of day."]

What you see: kubectl get pv shows pv-data-0 in "Terminating" status for 2 hours. kubectl get pvc shows the new claim is Pending. The storage class has reclaimPolicy: Delete but the PV is not being deleted.

Choose your action: - A) Force delete the PV: kubectl delete pv --force --grace-period=0 - B) Check if the PV has finalizers preventing deletion - C) Check the EBS volume status in AWS - D) Recreate the storage class

If you chose B (recommended):¶

[Result: kubectl get pv pv-data-0 -o jsonpath='{.metadata.finalizers}' shows ["kubernetes.io/pv-protection"]. The PV protection finalizer is preventing deletion. But this finalizer should clear once no pod is using the PV. Check for bound pods. Proceed to Round 2.]

If you chose A:¶

[Result: Force delete removes the PV from Kubernetes but does NOT delete the EBS volume in AWS. Orphaned EBS volume continues to incur cost and cannot be reused.]

If you chose C:¶

[Result: EBS volume is "in-use" status. Something is still attaching to it. Partial clue but you need to check Kubernetes side first.]

If you chose D:¶

[Result: Storage class is fine. The issue is the stuck PV, not the provisioner.]

Round 2: First Triage Data¶

[Pressure cue: "PV stuck for 2+ hours. New database replica cannot start."]

What you see: The PV protection finalizer prevents deletion while a PVC is bound. But the PVC was deleted. However, a VolumeAttachment object still exists, keeping the EBS volume attached to a node. The node had a pod that was force-deleted without proper cleanup.

Choose your action: - A) Delete the VolumeAttachment object - B) Detach the EBS volume from the node in AWS - C) Check which pod/node still has the volume attached - D) Restart the CSI driver pods

If you chose C (recommended):¶

[Result: kubectl get volumeattachments | grep pv-data-0 shows the volume is attached to k8s-worker-04. But no pod on worker-04 is using this volume. The attachment is orphaned — the pod was force-deleted and the detach never completed. Proceed to Round 3.]

If you chose A:¶

[Result: Deleting the VolumeAttachment tells the CSI driver to detach. This works but is bypassing normal cleanup. May leave the EBS volume in a bad state if the detach fails.]

If you chose B:¶

[Result: AWS-side detach works but Kubernetes does not know about it. The VolumeAttachment object and PV finalizer remain. State is inconsistent.]

If you chose D:¶

[Result: CSI driver restart may trigger a reconciliation loop that cleans up orphaned attachments. Possible fix but slow and affects all volumes.]

Round 3: Root Cause Identification¶

[Pressure cue: "Orphaned VolumeAttachment found. Clean it up."]

What you see: Root cause: A pod using this PV was force-deleted (kubectl delete pod --force --grace-period=0) during an incident 3 hours ago. Force deletion skips the normal unmount/detach lifecycle. The VolumeAttachment was left behind, and the CSI driver's reconciliation could not clean it up because the node's kubelet had already forgotten about the volume.

Choose your action: - A) Delete the VolumeAttachment and let the CSI driver handle EBS detach - B) Manually detach in AWS then delete the VolumeAttachment in Kubernetes - C) Delete the VolumeAttachment, remove the PV finalizer, then verify EBS cleanup - D) Option A is safest — let the CSI driver orchestrate the detach

If you chose A (recommended):¶

[Result: kubectl delete volumeattachment csi-xxxxxxx. CSI driver detects the deletion, calls AWS to detach the EBS volume. EBS detaches successfully. PV finalizer clears. PV is deleted. New PVC can now bind to a fresh PV. Proceed to Round 4.]

If you chose B:¶

[Result: Works but manual AWS operations bypass the CSI driver and can cause state inconsistencies.]

If you chose C:¶

[Result: Removing the finalizer before the VolumeAttachment is cleaned up may orphan the EBS volume in AWS.]

If you chose D:¶

[Result: Same as A. Correct approach.]

Round 4: Remediation¶

[Pressure cue: "PV deleted. New PVC binding. Verify."]

Actions: 1. Verify new PVC is Bound: kubectl get pvc 2. Verify StatefulSet replica starts: kubectl get pods 3. Verify EBS volume was properly deleted in AWS (if reclaim=Delete) 4. Add a runbook warning against --force --grace-period=0 for pods with PVs 5. Add monitoring for stuck PVs (Terminating > 30 minutes)

Damage Report¶

Total downtime: 0 (existing database replicas served traffic)
Blast radius: New database replica delayed 2.5 hours; scaling blocked
Optimal resolution time: 10 minutes (check finalizers -> find orphaned VolumeAttachment -> delete)
If every wrong choice was made: 4+ hours with orphaned EBS volumes and inconsistent state

Incident Replay: Persistent Volume Stuck Terminating¶

Setup¶

Round 1: Alert Fires¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose C (recommended):¶

If you chose A:¶

If you chose B:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose A (recommended):¶

If you chose B:¶

If you chose C:¶

If you chose D:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶