Portal | Level: L1: Foundations | Topics: Linux Fundamentals | Domain: Linux
Scenario: Linux Server Running Slow¶
The Prompt¶
"Users are reporting that the application is slow. The team says it's 'the server.' You're the first responder. Walk me through how you'd diagnose a slow Linux server."
Initial Report¶
Slack message from dev team: "API response times went from 50ms to 2+ seconds about 30 minutes ago. No deployments since yesterday. The app is running on a single EC2 instance. Can someone look at the server?"
Constraints¶
- Time pressure: Customer-facing degradation. Need to identify root cause within 15 minutes.
- Limited context: You don't know the application well. You have SSH access and monitoring dashboards.
Observable Evidence¶
- Dashboard: CPU usage is 95%, load average is 12 on a 4-core server
free -h: Shows 1.2G available out of 16G, swap usage is 3Gdf -h:/var/logis at 98%iostat -xz 1:/dev/xvdashows %util at 99%, await at 180ms
Expected Investigation Path¶
# 1. Quick system overview
uptime # Load average vs CPU count
free -h # Memory + swap
df -h # Disk space
dmesg -T | tail -20 # Recent kernel messages
# 2. Identify the bottleneck
iostat -xz 1 3 # Disk I/O (check %util, await)
mpstat -P ALL 1 3 # Per-CPU usage
vmstat 1 5 # Memory + swap activity (si/so)
# 3. Find the culprit process
top -bn1 | head -15 # Sort by CPU
ps aux --sort=-%mem | head -10 # Sort by memory
iotop -b -n 3 # Disk I/O per process
pidstat -d 1 5 # Detailed I/O per PID
# 4. Check logs
journalctl -p err --since "1 hour ago"
tail -50 /var/log/syslog
Strong Answer¶
"I'd start with the USE method — for each resource (CPU, memory, disk, network), I check utilization, saturation, and errors.
First, uptime tells me the load average. If it's above my CPU count, something is saturated. Then free -h for memory — I'm looking at available memory and swap usage. Any swap activity means memory pressure. df -h for disk space and iostat for disk I/O.
In this case, I see high disk I/O — 99% utilization with 180ms await times. That's the bottleneck. Normal SSD await is under 5ms. I'd run iotop to find which process is hammering the disk.
The 98% full /var/log is suspicious — likely a runaway log file. I'd check with du -sh /var/log/* and see if a service is logging excessively. If it's filling the disk, that would cause the high I/O as well.
Short term: truncate or rotate the log file, restart the offending service. Long term: set up proper log rotation with logrotate, add disk space monitoring alerts, and consider moving logs to a dedicated volume or shipping them externally."
Red Flags (Weak Answers)¶
- Jumping to restarting the server without investigation
- Only checking one resource (e.g., CPU) and ignoring disk I/O
- Not knowing
iostat,vmstat, oriotop - Not correlating the disk full condition with performance
- Suggesting adding more RAM when the problem is disk I/O
Follow-ups¶
- "What if the disk I/O is from a database query and not logs?"
- "How would you set up monitoring to catch this earlier?"
- "What's the difference between load average and CPU utilization?"
- "The swap is at 3G — should we add more RAM or more swap?"
Key Concepts Tested¶
- USE method: Utilization, Saturation, Errors — systematic approach
- Linux performance tools:
iostat,vmstat,pidstat,iotop - Disk I/O understanding: %util, await, throughput vs IOPS
- Root cause analysis: Connecting disk full → high I/O → slow app
- Incident response discipline: Triage → identify → fix → prevent
Wiki Navigation¶
Related Content¶
- /proc Filesystem (Topic Pack, L2) — Linux Fundamentals
- Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
- Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
- Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
- Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
- Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
- Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Linux Fundamentals
- Case Study: Inode Exhaustion (Case Study, L1) — Linux Fundamentals