Skip to content

Incident Replay: Inode Exhaustion

Setup

  • System context: Production web server running a PHP application with file-based session storage. df -h shows 40% disk usage but no files can be created.
  • Time: Thursday 04:15 UTC
  • Your role: On-call SRE

Round 1: Alert Fires

[Pressure cue: "Monitoring fires — web application returning 500 errors. 'Cannot create temp file.' Disk shows 40% free. Auto-escalation in 5 minutes."]

What you see: SSH works but touch /tmp/test fails with "No space left on device." df -h shows / at 60% usage — plenty of space. Application logs show "failed to open session file."

Choose your action: - A) Check disk quota settings for the application user - B) Check inode usage with df -i - C) Expand the filesystem to add more space - D) Restart the web server to clear temp files

[Result: df -i / shows 100% inode usage — 0 free inodes. The filesystem has space but no more files can be created because every inode is allocated. Proceed to Round 2.]

If you chose A:

[Result: No disk quotas are set. The issue is not quota-related.]

If you chose C:

[Result: Adding disk space does not add inodes on ext4. Inodes are fixed at mkfs time. Wasted effort.]

If you chose D:

[Result: Web server restart fails — it cannot create its PID file. You have made things worse.]

Round 2: First Triage Data

[Pressure cue: "Web application is down. Users cannot log in (sessions cannot be created). Revenue impact."]

What you see: Need to find where the inodes are consumed. Millions of tiny files exist somewhere on the filesystem.

Choose your action: - A) Run find / -xdev -type f | wc -l to count total files - B) Check common locations for small file accumulation: /tmp, /var/spool, session dirs - C) Run for d in /*; do echo "$d: $(find $d -xdev -type f | wc -l)"; done - D) Delete all files older than 7 days in /tmp

[Result: /var/lib/php/sessions/ contains 3.2 million tiny session files. The session garbage collection (gc) is disabled — session.gc_probability = 0 in php.ini. Sessions have been accumulating for months. Proceed to Round 3.]

If you chose A:

[Result: Gives you the total count (3.4 million) but not the location. Slow — takes 10+ minutes to traverse.]

If you chose C:

[Result: Good approach — shows /var/lib has 3.2M files. Narrows it down but takes 5 minutes.]

If you chose D:

[Result: /tmp only has 2000 files. The sessions are in a different directory. No improvement.]

Round 3: Root Cause Identification

[Pressure cue: "Found 3.2M session files. Clean them up."]

What you see: Root cause: PHP session garbage collection was disabled (session.gc_probability = 0) for performance reasons. Sessions are created but never cleaned up. Over 8 months, 3.2 million session files consumed all available inodes.

Choose your action: - A) Delete all session files: find /var/lib/php/sessions -type f -delete - B) Delete sessions older than 24 hours, then enable GC - C) Enable session GC and wait for it to clean up - D) Switch to Redis for session storage

[Result: find /var/lib/php/sessions -type f -mtime +1 -delete removes 3.1M files. Takes 3 minutes. Inodes drop to 5% used. Then set session.gc_probability = 1 and session.gc_maxlifetime = 1440 in php.ini. Restart PHP-FPM. Proceed to Round 4.]

If you chose A:

[Result: Deleting ALL sessions logs out every active user. Including those currently in checkout. Business impact.]

If you chose C:

[Result: GC processes a few sessions per request. With 3.2M files, it would take weeks to clean up at normal traffic rates.]

If you chose D:

[Result: Redis is a good long-term solution but requires application config changes, testing, and deployment. Not an incident-time fix.]

Round 4: Remediation

[Pressure cue: "Sessions cleaned. Application recovering."]

Actions: 1. Verify inode usage is healthy: df -i / 2. Verify application can create new sessions 3. Verify PHP session GC is enabled: check phpinfo() 4. Add inode usage monitoring with alert at 80% 5. Plan migration from file-based sessions to Redis

Damage Report

  • Total downtime: 20 minutes (application 500 errors)
  • Blast radius: All users unable to log in or maintain sessions
  • Optimal resolution time: 8 minutes (check df -i -> find sessions -> bulk delete old -> enable GC)
  • If every wrong choice was made: 60+ minutes plus user session loss from deleting active sessions

Cross-References