Portal | Level: L1: Foundations | Topics: Linux Fundamentals | Domain: Linux

Scenario: Linux Server Running Slow¶

The Prompt¶

"Users are reporting that the application is slow. The team says it's 'the server.' You're the first responder. Walk me through how you'd diagnose a slow Linux server."

Initial Report¶

Slack message from dev team: "API response times went from 50ms to 2+ seconds about 30 minutes ago. No deployments since yesterday. The app is running on a single EC2 instance. Can someone look at the server?"

Constraints¶

Time pressure: Customer-facing degradation. Need to identify root cause within 15 minutes.
Limited context: You don't know the application well. You have SSH access and monitoring dashboards.

Observable Evidence¶

Dashboard: CPU usage is 95%, load average is 12 on a 4-core server
free -h: Shows 1.2G available out of 16G, swap usage is 3G
df -h: /var/log is at 98%
iostat -xz 1: /dev/xvda shows %util at 99%, await at 180ms

Expected Investigation Path¶

# 1. Quick system overview
uptime                          # Load average vs CPU count
free -h                         # Memory + swap
df -h                           # Disk space
dmesg -T | tail -20             # Recent kernel messages

# 2. Identify the bottleneck
iostat -xz 1 3                  # Disk I/O (check %util, await)
mpstat -P ALL 1 3               # Per-CPU usage
vmstat 1 5                      # Memory + swap activity (si/so)

# 3. Find the culprit process
top -bn1 | head -15             # Sort by CPU
ps aux --sort=-%mem | head -10  # Sort by memory
iotop -b -n 3                   # Disk I/O per process
pidstat -d 1 5                  # Detailed I/O per PID

# 4. Check logs
journalctl -p err --since "1 hour ago"
tail -50 /var/log/syslog

Strong Answer¶

"I'd start with the USE method — for each resource (CPU, memory, disk, network), I check utilization, saturation, and errors.

First, uptime tells me the load average. If it's above my CPU count, something is saturated. Then free -h for memory — I'm looking at available memory and swap usage. Any swap activity means memory pressure. df -h for disk space and iostat for disk I/O.

In this case, I see high disk I/O — 99% utilization with 180ms await times. That's the bottleneck. Normal SSD await is under 5ms. I'd run iotop to find which process is hammering the disk.

The 98% full /var/log is suspicious — likely a runaway log file. I'd check with du -sh /var/log/* and see if a service is logging excessively. If it's filling the disk, that would cause the high I/O as well.

Short term: truncate or rotate the log file, restart the offending service. Long term: set up proper log rotation with logrotate, add disk space monitoring alerts, and consider moving logs to a dedicated volume or shipping them externally."

Red Flags (Weak Answers)¶

Jumping to restarting the server without investigation
Only checking one resource (e.g., CPU) and ignoring disk I/O
Not knowing iostat, vmstat, or iotop
Not correlating the disk full condition with performance
Suggesting adding more RAM when the problem is disk I/O

Follow-ups¶

"What if the disk I/O is from a database query and not logs?"
"How would you set up monitoring to catch this earlier?"
"What's the difference between load average and CPU utilization?"
"The swap is at 3G — should we add more RAM or more swap?"

Key Concepts Tested¶

USE method: Utilization, Saturation, Errors — systematic approach
Linux performance tools: iostat, vmstat, pidstat, iotop
Disk I/O understanding: %util, await, throughput vs IOPS
Root cause analysis: Connecting disk full → high I/O → slow app
Incident response discipline: Triage → identify → fix → prevent

/proc Filesystem (Topic Pack, L2) — Linux Fundamentals
Advanced Bash for Ops (Topic Pack, L1) — Linux Fundamentals
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Linux Fundamentals
Bash Exercises (Quest Ladder) (CLI) (Exercise Set, L0) — Linux Fundamentals
Case Study: CI Pipeline Fails — Docker Layer Cache Corruption (Case Study, L2) — Linux Fundamentals
Case Study: Container Vuln Scanner False Positive Blocks Deploy (Case Study, L2) — Linux Fundamentals
Case Study: Disk Full Root Services Down (Case Study, L1) — Linux Fundamentals
Case Study: Disk Full — Runaway Logs, Fix Is Loki Retention (Case Study, L2) — Linux Fundamentals
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Linux Fundamentals
Case Study: Inode Exhaustion (Case Study, L1) — Linux Fundamentals

Scenario: Linux Server Running Slow¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Strong Answer¶

Red Flags (Weak Answers)¶

Follow-ups¶

Key Concepts Tested¶

Wiki Navigation¶

Pages that link here¶

Scenario: Linux Server Running Slow¶

The Prompt¶

Initial Report¶

Constraints¶

Observable Evidence¶

Expected Investigation Path¶

Strong Answer¶

Red Flags (Weak Answers)¶

Follow-ups¶

Key Concepts Tested¶

Wiki Navigation¶

Related Content¶

Pages that link here¶