Lab 14: Log Analysis¶

Field	Value
Tier	3 — Operations
Estimated Time	60 minutes
Prerequisites	k3s cluster, basic CLI
Auto-Grade	Yes

Scenario¶

At 3:47 AM, your monitoring system detected elevated error rates on the payment service. By the time the on-call engineer woke up, the error rate had returned to normal. No customers complained, but the engineering manager wants a root cause analysis. The only evidence is the logs.

You have been given a set of application logs from the incident window. The logs span four services: the API gateway, the payment service, the user service, and the database proxy. Somewhere in these logs is the chain of events that caused the errors. You need to correlate events across services using timestamps and request IDs, identify the root cause, and write a summary.

Objectives¶

Parse the log files and identify the time window of the incident
Find the first error log entry that started the cascade
Trace the request ID from the first error across all four services
Identify the root cause (which service and what failure)
Count the total number of affected requests during the incident window
Write an incident summary to /tmp/lab-logs/incident-summary.txt
Create a script that extracts all error-level logs sorted by timestamp

Setup¶

./setup.sh

Creates log files at /tmp/lab-logs/ simulating the incident.

Hints¶

Hint 1: Grep for errors

Start with `grep -i error /tmp/lab-logs/*.log | sort -t' ' -k1,2` to see all errors sorted by timestamp.

Hint 2: Request ID correlation

Each log line contains a request ID like `req-abc123`. Find the first error, extract its request ID, then grep for that ID across all log files.

Hint 3: Timeline reconstruction

Sort all logs by timestamp to build a timeline: `sort -t' ' -k1,2 /tmp/lab-logs/*.log | less`

Hint 4: Counting affected requests

Extract unique request IDs from error lines: `grep -i error /tmp/lab-logs/*.log | grep -oE 'req-[a-z0-9]+' | sort -u | wc -l`

Hint 5: Incident summary format

Your summary should include: time window, root cause, affected services, number of affected requests, and recommended remediation.

Grading¶

./grade.sh

Solution¶

See the solution/ directory for the complete analysis.