Incident Replay: Stuck NFS Mount¶
Setup¶
- System context: Application server with an NFS mount (
/mnt/shared) used for shared configuration files. The NFS server became unreachable and now processes accessing the mount are hanging. - Time: Thursday 22:30 UTC
- Your role: On-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Multiple processes on app-prod-04 are hung. SSH sessions freeze when running 'ls' or 'df'. Server appears to be partially alive."]
What you see:
SSH login succeeds but any command that touches the filesystem hangs. ps aux shows multiple processes in D (uninterruptible sleep) state. The NFS mount at /mnt/shared is the culprit — the NFS server is unreachable.
Choose your action:
- A) Reboot the server
- B) Try to unmount the NFS share: umount /mnt/shared
- C) Try lazy unmount: umount -l /mnt/shared
- D) Check why the NFS server is down
If you chose C (recommended):¶
[Result:
umount -l /mnt/shareddetaches the mount from the filesystem namespace immediately. Processes that were hung in D state begin to return errors instead of hanging. The system becomes responsive. Proceed to Round 2.]
If you chose A:¶
[Result: Reboot may hang during shutdown waiting for NFS operations to complete or time out. Could take 10+ minutes or hang indefinitely.]
If you chose B:¶
[Result:
umounthangs because processes have open file handles on the mount. It waits for all references to close, which requires the NFS server. Circular dependency.]
If you chose D:¶
[Result: You cannot effectively troubleshoot from a half-frozen server. Fix the local system first, then investigate the NFS server.]
Round 2: First Triage Data¶
[Pressure cue: "Server is responsive again. Applications are erroring on missing config files from /mnt/shared. NFS server is still down."]
What you see: The NFS server (nfs-prod-01) is unreachable — ping times out. The NFS server had a kernel panic and is being recovered by the datacenter team. ETA: 30 minutes.
Choose your action: - A) Wait for the NFS server to come back and remount - B) Copy essential config files from a backup to a local directory - C) Point the application to a secondary NFS server (if one exists) - D) Cache the NFS-hosted config files locally and serve from cache
If you chose B (recommended):¶
[Result: Essential config files (3 files, 50KB total) are available in the application's git repo. Copy them to
/mnt/shared-local/and update the application's config path. Application starts with local configs. Proceed to Round 3.]
If you chose A:¶
[Result: 30 minutes of application downtime waiting for NFS recovery. Unacceptable for production.]
If you chose C:¶
[Result: No secondary NFS server exists. This is a single point of failure.]
If you chose D:¶
[Result: Good idea but requires application code changes for caching logic. Not an incident-time fix.]
Round 3: Root Cause Identification¶
[Pressure cue: "Application running on local configs. NFS server recovered. Remount and document."]
What you see:
Root cause: Hard NFS mount (the default) caused processes to hang indefinitely when the NFS server went down. The mount options did not include soft, timeo, or retrans parameters for timeout handling.
Choose your action:
- A) Remount with soft,timeo=30,retrans=3 options
- B) Remount with hard,intr options to allow interrupt of hung operations
- C) Switch to autofs for on-demand mounting with timeout
- D) Add soft mount options and set up a secondary NFS server for HA
If you chose D (recommended):¶
[Result: NFS remounted with
soft,timeo=30,retrans=3to prevent infinite hangs. Secondary NFS server planned for HA. Application also configured to fall back to local config cache. Proceed to Round 4.]
If you chose A:¶
[Result: Soft mount prevents hangs but returns errors to applications. Need to ensure the application handles NFS errors gracefully.]
If you chose B:¶
[Result:
hard,intrallows interrupt with Ctrl+C but automated processes still hang. Not fully automated resilience.]
If you chose C:¶
[Result: autofs is good for home directories but adds complexity for application mounts.]
Round 4: Remediation¶
[Pressure cue: "NFS remounted. Application healthy. Close."]
Actions:
1. Verify NFS mount is active: mount | grep nfs and df -h /mnt/shared
2. Verify application is using NFS-hosted configs again
3. Update /etc/fstab with soft mount options
4. Add NFS server health monitoring
5. Plan secondary NFS server deployment for high availability
Damage Report¶
- Total downtime: 15 minutes (processes hung until lazy unmount + config workaround)
- Blast radius: All applications dependent on NFS shared configs on this server
- Optimal resolution time: 5 minutes (lazy unmount -> local config fallback)
- If every wrong choice was made: 60+ minutes of frozen server plus 30-minute NFS recovery wait
Cross-References¶
- Primer: Mounts & Filesystems
- Primer: Linux Ops
- Footguns: Linux Ops