Skip to content

Portal | Level: L1: Foundations | Topics: Linux Fundamentals | Domain: Linux

Scenario: Linux Server Running Slow

The Prompt

"Users are reporting that the application is slow. The team says it's 'the server.' You're the first responder. Walk me through how you'd diagnose a slow Linux server."

Initial Report

Slack message from dev team: "API response times went from 50ms to 2+ seconds about 30 minutes ago. No deployments since yesterday. The app is running on a single EC2 instance. Can someone look at the server?"

Constraints

  • Time pressure: Customer-facing degradation. Need to identify root cause within 15 minutes.
  • Limited context: You don't know the application well. You have SSH access and monitoring dashboards.

Observable Evidence

  • Dashboard: CPU usage is 95%, load average is 12 on a 4-core server
  • free -h: Shows 1.2G available out of 16G, swap usage is 3G
  • df -h: /var/log is at 98%
  • iostat -xz 1: /dev/xvda shows %util at 99%, await at 180ms

Expected Investigation Path

# 1. Quick system overview
uptime                          # Load average vs CPU count
free -h                         # Memory + swap
df -h                           # Disk space
dmesg -T | tail -20             # Recent kernel messages

# 2. Identify the bottleneck
iostat -xz 1 3                  # Disk I/O (check %util, await)
mpstat -P ALL 1 3               # Per-CPU usage
vmstat 1 5                      # Memory + swap activity (si/so)

# 3. Find the culprit process
top -bn1 | head -15             # Sort by CPU
ps aux --sort=-%mem | head -10  # Sort by memory
iotop -b -n 3                   # Disk I/O per process
pidstat -d 1 5                  # Detailed I/O per PID

# 4. Check logs
journalctl -p err --since "1 hour ago"
tail -50 /var/log/syslog

Strong Answer

"I'd start with the USE method — for each resource (CPU, memory, disk, network), I check utilization, saturation, and errors.

First, uptime tells me the load average. If it's above my CPU count, something is saturated. Then free -h for memory — I'm looking at available memory and swap usage. Any swap activity means memory pressure. df -h for disk space and iostat for disk I/O.

In this case, I see high disk I/O — 99% utilization with 180ms await times. That's the bottleneck. Normal SSD await is under 5ms. I'd run iotop to find which process is hammering the disk.

The 98% full /var/log is suspicious — likely a runaway log file. I'd check with du -sh /var/log/* and see if a service is logging excessively. If it's filling the disk, that would cause the high I/O as well.

Short term: truncate or rotate the log file, restart the offending service. Long term: set up proper log rotation with logrotate, add disk space monitoring alerts, and consider moving logs to a dedicated volume or shipping them externally."

Red Flags (Weak Answers)

  • Jumping to restarting the server without investigation
  • Only checking one resource (e.g., CPU) and ignoring disk I/O
  • Not knowing iostat, vmstat, or iotop
  • Not correlating the disk full condition with performance
  • Suggesting adding more RAM when the problem is disk I/O

Follow-ups

  1. "What if the disk I/O is from a database query and not logs?"
  2. "How would you set up monitoring to catch this earlier?"
  3. "What's the difference between load average and CPU utilization?"
  4. "The swap is at 3G — should we add more RAM or more swap?"

Key Concepts Tested

  • USE method: Utilization, Saturation, Errors — systematic approach
  • Linux performance tools: iostat, vmstat, pidstat, iotop
  • Disk I/O understanding: %util, await, throughput vs IOPS
  • Root cause analysis: Connecting disk full → high I/O → slow app
  • Incident response discipline: Triage → identify → fix → prevent

Wiki Navigation