Incident Replay: Inode Exhaustion¶

Setup¶

System context: Production web server running a PHP application with file-based session storage. df -h shows 40% disk usage but no files can be created.
Time: Thursday 04:15 UTC
Your role: On-call SRE

Round 1: Alert Fires¶

[Pressure cue: "Monitoring fires — web application returning 500 errors. 'Cannot create temp file.' Disk shows 40% free. Auto-escalation in 5 minutes."]

What you see: SSH works but touch /tmp/test fails with "No space left on device." df -h shows / at 60% usage — plenty of space. Application logs show "failed to open session file."

Choose your action: - A) Check disk quota settings for the application user - B) Check inode usage with df -i - C) Expand the filesystem to add more space - D) Restart the web server to clear temp files

If you chose B (recommended):¶

[Result: df -i / shows 100% inode usage — 0 free inodes. The filesystem has space but no more files can be created because every inode is allocated. Proceed to Round 2.]

If you chose A:¶

[Result: No disk quotas are set. The issue is not quota-related.]

If you chose C:¶

[Result: Adding disk space does not add inodes on ext4. Inodes are fixed at mkfs time. Wasted effort.]

If you chose D:¶

[Result: Web server restart fails — it cannot create its PID file. You have made things worse.]

Round 2: First Triage Data¶

[Pressure cue: "Web application is down. Users cannot log in (sessions cannot be created). Revenue impact."]

What you see: Need to find where the inodes are consumed. Millions of tiny files exist somewhere on the filesystem.

Choose your action: - A) Run find / -xdev -type f | wc -l to count total files - B) Check common locations for small file accumulation: /tmp, /var/spool, session dirs - C) Run for d in /*; do echo "$d: $(find $d -xdev -type f | wc -l)"; done - D) Delete all files older than 7 days in /tmp

If you chose B (recommended):¶

[Result: /var/lib/php/sessions/ contains 3.2 million tiny session files. The session garbage collection (gc) is disabled — session.gc_probability = 0 in php.ini. Sessions have been accumulating for months. Proceed to Round 3.]

If you chose A:¶

[Result: Gives you the total count (3.4 million) but not the location. Slow — takes 10+ minutes to traverse.]

If you chose C:¶

[Result: Good approach — shows /var/lib has 3.2M files. Narrows it down but takes 5 minutes.]

If you chose D:¶

[Result: /tmp only has 2000 files. The sessions are in a different directory. No improvement.]

Round 3: Root Cause Identification¶

[Pressure cue: "Found 3.2M session files. Clean them up."]

What you see: Root cause: PHP session garbage collection was disabled (session.gc_probability = 0) for performance reasons. Sessions are created but never cleaned up. Over 8 months, 3.2 million session files consumed all available inodes.

Choose your action: - A) Delete all session files: find /var/lib/php/sessions -type f -delete - B) Delete sessions older than 24 hours, then enable GC - C) Enable session GC and wait for it to clean up - D) Switch to Redis for session storage

If you chose B (recommended):¶

[Result: find /var/lib/php/sessions -type f -mtime +1 -delete removes 3.1M files. Takes 3 minutes. Inodes drop to 5% used. Then set session.gc_probability = 1 and session.gc_maxlifetime = 1440 in php.ini. Restart PHP-FPM. Proceed to Round 4.]

If you chose A:¶

[Result: Deleting ALL sessions logs out every active user. Including those currently in checkout. Business impact.]

If you chose C:¶

[Result: GC processes a few sessions per request. With 3.2M files, it would take weeks to clean up at normal traffic rates.]

If you chose D:¶

[Result: Redis is a good long-term solution but requires application config changes, testing, and deployment. Not an incident-time fix.]

Round 4: Remediation¶

[Pressure cue: "Sessions cleaned. Application recovering."]

Actions: 1. Verify inode usage is healthy: df -i / 2. Verify application can create new sessions 3. Verify PHP session GC is enabled: check phpinfo() 4. Add inode usage monitoring with alert at 80% 5. Plan migration from file-based sessions to Redis

Damage Report¶

Total downtime: 20 minutes (application 500 errors)
Blast radius: All users unable to log in or maintain sessions
Optimal resolution time: 8 minutes (check df -i -> find sessions -> bulk delete old -> enable GC)
If every wrong choice was made: 60+ minutes plus user session loss from deleting active sessions

Cross-References¶

Primer: Linux Ops
Primer: Inodes
Footguns: Linux Ops

Incident Replay: Inode Exhaustion¶

Setup¶

Round 1: Alert Fires¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 2: First Triage Data¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 3: Root Cause Identification¶

If you chose B (recommended):¶

If you chose A:¶

If you chose C:¶

If you chose D:¶

Round 4: Remediation¶

Damage Report¶

Cross-References¶

Pages that link here¶