Lab 19: Multi-Cluster¶

Field	Value
Tier	4 — Advanced
Estimated Time	90 minutes
Prerequisites	2x k3s clusters (or simulated via namespaces)
Auto-Grade	Yes

Scenario¶

Your company is expanding to a second region for disaster recovery. The business requires that if the primary cluster goes down, traffic automatically fails over to the secondary cluster within 60 seconds. You need to deploy the same application to both clusters (simulated as two namespaces), configure health checking between them, set up a failover mechanism, and verify it works by simulating a primary cluster failure.

In this lab, we simulate two clusters using two namespaces: lab-cluster-primary and lab-cluster-secondary. A "router" pod in namespace lab-cluster-router simulates DNS-based failover by health-checking both clusters and routing to the healthy one.

Objectives¶

Deploy the application in lab-cluster-primary (3 replicas, service)
Deploy the same application in lab-cluster-secondary (3 replicas, service)
Deploy a health-check router in lab-cluster-router
Router correctly routes to primary when both clusters are healthy
Simulate primary failure (scale to 0) and verify failover to secondary
Restore primary and verify traffic returns to primary
Write a failover report to /tmp/lab-multicluster/failover-report.txt

Setup¶

./setup.sh

Creates the three namespaces and deploys a router skeleton.

Hints¶

Hint 1: Simulating clusters with namespaces

Each namespace acts as an independent "cluster." Cross-namespace service access uses `..svc.cluster.local`.

Hint 2: Health check script

The router can run a loop that checks each cluster's health endpoint:

if wget -qO- --timeout=3 http://app.lab-cluster-primary.svc:80/ >/dev/null 2>&1; then
  echo "primary"
else
  echo "secondary"
fi

Hint 3: Simulating failure

`kubectl scale deployment app --replicas=0 -n lab-cluster-primary` simulates a primary cluster outage. The router should detect this within its check interval.

Hint 4: Failover verification

After scaling primary to 0, the router should return responses from the secondary. Check logs or responses to verify the active backend changed.

Hint 5: Failback

Scale primary back to 3: `kubectl scale deployment app --replicas=3 -n lab-cluster-primary`. The router should prefer primary when it is healthy again.

Grading¶

./grade.sh

Solution¶

See the solution/ directory for complete multi-cluster configuration.