Incident Replay: Inode Exhaustion¶
Setup¶
- System context: Production web server running a PHP application with file-based session storage.
df -hshows 40% disk usage but no files can be created. - Time: Thursday 04:15 UTC
- Your role: On-call SRE
Round 1: Alert Fires¶
[Pressure cue: "Monitoring fires — web application returning 500 errors. 'Cannot create temp file.' Disk shows 40% free. Auto-escalation in 5 minutes."]
What you see:
SSH works but touch /tmp/test fails with "No space left on device." df -h shows / at 60% usage — plenty of space. Application logs show "failed to open session file."
Choose your action:
- A) Check disk quota settings for the application user
- B) Check inode usage with df -i
- C) Expand the filesystem to add more space
- D) Restart the web server to clear temp files
If you chose B (recommended):¶
[Result:
df -i /shows 100% inode usage — 0 free inodes. The filesystem has space but no more files can be created because every inode is allocated. Proceed to Round 2.]
If you chose A:¶
[Result: No disk quotas are set. The issue is not quota-related.]
If you chose C:¶
[Result: Adding disk space does not add inodes on ext4. Inodes are fixed at mkfs time. Wasted effort.]
If you chose D:¶
[Result: Web server restart fails — it cannot create its PID file. You have made things worse.]
Round 2: First Triage Data¶
[Pressure cue: "Web application is down. Users cannot log in (sessions cannot be created). Revenue impact."]
What you see: Need to find where the inodes are consumed. Millions of tiny files exist somewhere on the filesystem.
Choose your action:
- A) Run find / -xdev -type f | wc -l to count total files
- B) Check common locations for small file accumulation: /tmp, /var/spool, session dirs
- C) Run for d in /*; do echo "$d: $(find $d -xdev -type f | wc -l)"; done
- D) Delete all files older than 7 days in /tmp
If you chose B (recommended):¶
[Result:
/var/lib/php/sessions/contains 3.2 million tiny session files. The session garbage collection (gc) is disabled —session.gc_probability = 0in php.ini. Sessions have been accumulating for months. Proceed to Round 3.]
If you chose A:¶
[Result: Gives you the total count (3.4 million) but not the location. Slow — takes 10+ minutes to traverse.]
If you chose C:¶
[Result: Good approach — shows
/var/libhas 3.2M files. Narrows it down but takes 5 minutes.]
If you chose D:¶
[Result:
/tmponly has 2000 files. The sessions are in a different directory. No improvement.]
Round 3: Root Cause Identification¶
[Pressure cue: "Found 3.2M session files. Clean them up."]
What you see:
Root cause: PHP session garbage collection was disabled (session.gc_probability = 0) for performance reasons. Sessions are created but never cleaned up. Over 8 months, 3.2 million session files consumed all available inodes.
Choose your action:
- A) Delete all session files: find /var/lib/php/sessions -type f -delete
- B) Delete sessions older than 24 hours, then enable GC
- C) Enable session GC and wait for it to clean up
- D) Switch to Redis for session storage
If you chose B (recommended):¶
[Result:
find /var/lib/php/sessions -type f -mtime +1 -deleteremoves 3.1M files. Takes 3 minutes. Inodes drop to 5% used. Then setsession.gc_probability = 1andsession.gc_maxlifetime = 1440in php.ini. Restart PHP-FPM. Proceed to Round 4.]
If you chose A:¶
[Result: Deleting ALL sessions logs out every active user. Including those currently in checkout. Business impact.]
If you chose C:¶
[Result: GC processes a few sessions per request. With 3.2M files, it would take weeks to clean up at normal traffic rates.]
If you chose D:¶
[Result: Redis is a good long-term solution but requires application config changes, testing, and deployment. Not an incident-time fix.]
Round 4: Remediation¶
[Pressure cue: "Sessions cleaned. Application recovering."]
Actions:
1. Verify inode usage is healthy: df -i /
2. Verify application can create new sessions
3. Verify PHP session GC is enabled: check phpinfo()
4. Add inode usage monitoring with alert at 80%
5. Plan migration from file-based sessions to Redis
Damage Report¶
- Total downtime: 20 minutes (application 500 errors)
- Blast radius: All users unable to log in or maintain sessions
- Optimal resolution time: 8 minutes (check
df -i-> find sessions -> bulk delete old -> enable GC) - If every wrong choice was made: 60+ minutes plus user session loss from deleting active sessions