Skip to content

Lab 11: Monitoring Stack

Field Value
Tier 3 — Operations
Estimated Time 60 minutes
Prerequisites k3s cluster, Helm
Auto-Grade Yes

Scenario

Your company just experienced a 4-hour outage that nobody detected until customers complained on social media. The postmortem revealed a critical gap: there is no monitoring, no alerting, and no dashboards. The CTO has given you one week to deploy a monitoring stack. You are starting with the foundation: Prometheus for metrics collection and Grafana for visualization.

You need to deploy Prometheus to scrape metrics from all pods in the cluster, configure alert rules for common failure scenarios (high CPU, pod restarts, node pressure), deploy Grafana with a pre-configured dashboard, and verify the entire pipeline works end-to-end by triggering a test alert.

Objectives

  • Deploy Prometheus in namespace lab-monitoring using a Deployment
  • Configure Prometheus to scrape all pods with the annotation prometheus.io/scrape: "true"
  • Create an alerting rule: fire when any pod has restarted more than 5 times in 10 minutes
  • Deploy Grafana with a datasource pointing to Prometheus
  • Create a ConfigMap-based dashboard showing pod CPU and memory usage
  • Deploy a sample app with metrics endpoint and verify Prometheus scrapes it
  • Verify the alert rule appears in Prometheus UI (or API)

Setup

./setup.sh

Creates namespace lab-monitoring with partial Prometheus config.

Hints

Hint 1: Prometheus ConfigMap Prometheus configuration goes in a ConfigMap mounted at `/etc/prometheus/prometheus.yml`. Key sections: `scrape_configs` with `kubernetes_sd_configs`.
Hint 2: Pod annotation-based discovery
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
Hint 3: Alert rules Create a separate ConfigMap for rules and reference it in prometheus.yml under `rule_files`. Use `kube_pod_container_status_restarts_total` metric.
Hint 4: Grafana datasource Configure via environment variables or a provisioning ConfigMap at `/etc/grafana/provisioning/datasources/`.
Hint 5: Testing the pipeline Deploy a pod that exposes `/metrics` with Prometheus-format data. Add the `prometheus.io/scrape: "true"` annotation. Check Prometheus targets page.

Grading

./grade.sh

Solution

See the solution/ directory for complete manifests.