Skip to content

Lab 14: Log Analysis

Field Value
Tier 3 — Operations
Estimated Time 60 minutes
Prerequisites k3s cluster, basic CLI
Auto-Grade Yes

Scenario

At 3:47 AM, your monitoring system detected elevated error rates on the payment service. By the time the on-call engineer woke up, the error rate had returned to normal. No customers complained, but the engineering manager wants a root cause analysis. The only evidence is the logs.

You have been given a set of application logs from the incident window. The logs span four services: the API gateway, the payment service, the user service, and the database proxy. Somewhere in these logs is the chain of events that caused the errors. You need to correlate events across services using timestamps and request IDs, identify the root cause, and write a summary.

Objectives

  • Parse the log files and identify the time window of the incident
  • Find the first error log entry that started the cascade
  • Trace the request ID from the first error across all four services
  • Identify the root cause (which service and what failure)
  • Count the total number of affected requests during the incident window
  • Write an incident summary to /tmp/lab-logs/incident-summary.txt
  • Create a script that extracts all error-level logs sorted by timestamp

Setup

./setup.sh

Creates log files at /tmp/lab-logs/ simulating the incident.

Hints

Hint 1: Grep for errors Start with `grep -i error /tmp/lab-logs/*.log | sort -t' ' -k1,2` to see all errors sorted by timestamp.
Hint 2: Request ID correlation Each log line contains a request ID like `req-abc123`. Find the first error, extract its request ID, then grep for that ID across all log files.
Hint 3: Timeline reconstruction Sort all logs by timestamp to build a timeline: `sort -t' ' -k1,2 /tmp/lab-logs/*.log | less`
Hint 4: Counting affected requests Extract unique request IDs from error lines: `grep -i error /tmp/lab-logs/*.log | grep -oE 'req-[a-z0-9]+' | sort -u | wc -l`
Hint 5: Incident summary format Your summary should include: time window, root cause, affected services, number of affected requests, and recommended remediation.

Grading

./grade.sh

Solution

See the solution/ directory for the complete analysis.