Lab 11: Monitoring Stack¶

Field	Value
Tier	3 — Operations
Estimated Time	60 minutes
Prerequisites	k3s cluster, Helm
Auto-Grade	Yes

Scenario¶

Your company just experienced a 4-hour outage that nobody detected until customers complained on social media. The postmortem revealed a critical gap: there is no monitoring, no alerting, and no dashboards. The CTO has given you one week to deploy a monitoring stack. You are starting with the foundation: Prometheus for metrics collection and Grafana for visualization.

You need to deploy Prometheus to scrape metrics from all pods in the cluster, configure alert rules for common failure scenarios (high CPU, pod restarts, node pressure), deploy Grafana with a pre-configured dashboard, and verify the entire pipeline works end-to-end by triggering a test alert.

Objectives¶

Deploy Prometheus in namespace lab-monitoring using a Deployment
Configure Prometheus to scrape all pods with the annotation prometheus.io/scrape: "true"
Create an alerting rule: fire when any pod has restarted more than 5 times in 10 minutes
Deploy Grafana with a datasource pointing to Prometheus
Create a ConfigMap-based dashboard showing pod CPU and memory usage
Deploy a sample app with metrics endpoint and verify Prometheus scrapes it
Verify the alert rule appears in Prometheus UI (or API)

Setup¶

./setup.sh

Creates namespace lab-monitoring with partial Prometheus config.

Hints¶

Hint 1: Prometheus ConfigMap

Prometheus configuration goes in a ConfigMap mounted at `/etc/prometheus/prometheus.yml`. Key sections: `scrape_configs` with `kubernetes_sd_configs`.

Hint 2: Pod annotation-based discovery

- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true

Hint 3: Alert rules

Create a separate ConfigMap for rules and reference it in prometheus.yml under `rule_files`. Use `kube_pod_container_status_restarts_total` metric.

Hint 4: Grafana datasource

Configure via environment variables or a provisioning ConfigMap at `/etc/grafana/provisioning/datasources/`.

Hint 5: Testing the pipeline

Deploy a pod that exposes `/metrics` with Prometheus-format data. Add the `prometheus.io/scrape: "true"` annotation. Check Prometheus targets page.

Grading¶

./grade.sh

Solution¶

See the solution/ directory for complete manifests.