Lab 14: Log Analysis¶
| Field | Value |
|---|---|
| Tier | 3 — Operations |
| Estimated Time | 60 minutes |
| Prerequisites | k3s cluster, basic CLI |
| Auto-Grade | Yes |
Scenario¶
At 3:47 AM, your monitoring system detected elevated error rates on the payment service. By the time the on-call engineer woke up, the error rate had returned to normal. No customers complained, but the engineering manager wants a root cause analysis. The only evidence is the logs.
You have been given a set of application logs from the incident window. The logs span four services: the API gateway, the payment service, the user service, and the database proxy. Somewhere in these logs is the chain of events that caused the errors. You need to correlate events across services using timestamps and request IDs, identify the root cause, and write a summary.
Objectives¶
- Parse the log files and identify the time window of the incident
- Find the first error log entry that started the cascade
- Trace the request ID from the first error across all four services
- Identify the root cause (which service and what failure)
- Count the total number of affected requests during the incident window
- Write an incident summary to
/tmp/lab-logs/incident-summary.txt - Create a script that extracts all error-level logs sorted by timestamp
Setup¶
Creates log files at /tmp/lab-logs/ simulating the incident.
Hints¶
Hint 1: Grep for errors
Start with `grep -i error /tmp/lab-logs/*.log | sort -t' ' -k1,2` to see all errors sorted by timestamp.Hint 2: Request ID correlation
Each log line contains a request ID like `req-abc123`. Find the first error, extract its request ID, then grep for that ID across all log files.Hint 3: Timeline reconstruction
Sort all logs by timestamp to build a timeline: `sort -t' ' -k1,2 /tmp/lab-logs/*.log | less`Hint 4: Counting affected requests
Extract unique request IDs from error lines: `grep -i error /tmp/lab-logs/*.log | grep -oE 'req-[a-z0-9]+' | sort -u | wc -l`Hint 5: Incident summary format
Your summary should include: time window, root cause, affected services, number of affected requests, and recommended remediation.Grading¶
Solution¶
See the solution/ directory for the complete analysis.