On-Call Survival: CI/CD¶
Print this. Pin it. Read it at 3 AM.
Alert: Build Failure¶
Severity: P2 (blocks deploy) / P3 (non-blocking branch)
First command:
# GitHub Actions: gh run list --repo <org>/<repo> --limit 5
gh run list --limit 5
gh run view <run-id> --log-failed
Decision tree:
Is it a test failure?
├── Yes → Is the test flaky (passes on retry)?
│ gh run rerun <run-id> --failed (rerun only failed jobs)
│ If still failing: failing test is real. Do NOT ship. Alert dev team.
└── No → Is it a lint / format error?
├── Yes → Run locally: ruff check . && ruff format --check .
│ Fix the lint error. Push. Do not override CI to skip lint.
└── No → Is it a dependency fetch failure (pip/npm install timeout)?
├── Yes → Retry the run: gh run rerun <run-id>
│ Still failing? Registry or mirror down — see "Registry Issue" below.
└── No → Is it a secrets / credentials error ("auth failed", "403 Forbidden")?
Expired token in CI secrets? Update the secret in repo settings.
Escalate: "Build failing with auth error in step <name>: <paste>"
Escalation trigger: Build blocking a production hotfix; flaky test that cannot be resolved quickly; secrets rotation needed for production.
Safe actions: View build logs, rerun failed builds.
Dangerous actions: Skip failing checks (--no-verify, force merge), rotate production secrets.
Alert: Deploy Failure / Helm Upgrade Failed¶
Severity: P1 (production deploy failed mid-rollout) / P2 (staging)
First command:
What you're looking for: The failed revision, its status description, and the previous good revision.Decision tree:
Did the deploy start (any pods updated)?
├── No (failed before pods changed) → Check Helm error: helm status <release> -n <ns>
│ Schema validation error? Fix values and redeploy.
│ RBAC/permission error? Fix service account or escalate to infra.
└── Yes (partial rollout — some pods on new version, some on old)?
→ Is the new version serving errors?
├── Yes → ROLLBACK: helm rollback <release> <previous-revision> -n <ns>
│ Verify: helm history <release> -n <ns>; kubectl get pods -n <ns>
└── No → Wait for rollout to complete (it may self-heal).
kubectl rollout status deploy/<name> -n <ns>
Stuck > 10 min? → Rollback.
Escalate: "Deploy <release> partially rolled out, monitoring for <X> min"
Escalation trigger: Production rollback fails; new version is running but throwing 5xx; cannot determine safe rollback point.
Safe actions: helm history, helm status, kubectl rollout status — read-only.
Dangerous actions: helm rollback (restores previous version — coordinate with team), helm upgrade --force (may cause downtime).
Alert: Container Registry Issue (ImagePullBackOff / 401 / 403)¶
Severity: P1 (blocks all new pod scheduling)
First command:
What you're looking for:Failed to pull image ... 401 Unauthorized, 403 Forbidden, or not found.
Decision tree:
Is it 401/403 (auth failure)?
├── Yes → Is the imagePullSecret valid?
│ kubectl get secret <pull-secret> -n <ns> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
│ Expired token? Rotate registry credentials and update the secret.
└── No → Is it "image not found" / manifest unknown?
├── Yes → Wrong tag? Check the image tag in the deployment spec.
│ kubectl get deploy <name> -n <ns> -o yaml | grep image:
│ Pushed to wrong registry? Check CI build logs.
└── No → Is the registry itself unreachable (timeout / connection refused)?
curl -I https://<registry-host>/v2/
Registry down? → Use cached image (pinned tag, already on nodes) or escalate.
Escalate: "Registry <host> unreachable, new pods cannot pull images"
Escalation trigger: Registry unavailable and no cached images; cannot rotate credentials (token issuer unreachable); all environments affected.
Safe actions: Describe pod to see pull errors, inspect pull secret (decode, check expiry) — read-only.
Dangerous actions: Rotate registry credentials (brief auth disruption), change imagePullPolicy to Never (only uses cached images).
Alert: Rollback Needed¶
Severity: P1 (production is broken)
First command:
What you're looking for: The previous stable revision number.Decision tree:
Is the current version throwing errors (5xx, crashes)?
├── Yes → Identify last good version from history.
│ Helm: helm rollback <release> <revision> -n <ns>
│ kubectl: kubectl rollout undo deploy/<name> -n <ns>
│ Verify recovery: kubectl get pods -n <ns>; check error rate in Grafana.
└── No → Is the deploy stuck (not completing)?
├── Yes → kubectl rollout undo deploy/<name> -n <ns>; check if rollback completes.
└── No → Not a rollback situation. Monitor and investigate.
After rollback:
1. Confirm pods are running: kubectl get pods -n <ns>
2. Check error rate dropped (Grafana or logs).
3. Create an incident ticket.
4. Notify deploy team: "Rolled back
Escalation trigger: Rollback also fails; cannot determine a stable version; data corruption suspected.
Safe actions: View rollout history — read-only.
Dangerous actions: helm rollback, kubectl rollout undo (these change running workloads — coordinate with team).
Alert: CI/CD Pipeline Completely Down¶
Severity: P1 (cannot deploy or ship hotfixes)
First command:
What you're looking for: All runs failing at the same early step (checkout, auth, runner setup).Decision tree:
Failing at checkout / auth step?
├── Yes → GitHub Actions outage? Check https://www.githubstatus.com
│ If GitHub outage: wait. Document incident. Cannot fix externally.
└── No → Self-hosted runners down?
├── Yes → Check runner pods: kubectl get pods -n ci | grep runner
│ Runners crashed? kubectl rollout restart deploy/<runner> -n ci
└── No → CI configuration error (edited workflow recently)?
git log --oneline .github/workflows/ | head -5
Last change broke it? Revert the workflow change.
Escalate: "CI pipeline down, all runs failing at <step>"
Escalation trigger: CI down during a production incident requiring hotfix; GitHub outage with no alternative deploy path; self-hosted runners unrecoverable.
Safe actions: Check run status, view logs, check GitHub status page.
Dangerous actions: Edit CI workflow files, restart self-hosted runners, trigger manual deploys.
Quick Reference¶
Most Useful Commands¶
# List recent CI runs
gh run list --limit 10
# View failed run logs
gh run view <run-id> --log-failed
# Rerun only failed jobs
gh run rerun <run-id> --failed
# Helm release history
helm history <release> -n <ns>
# Helm rollback
helm rollback <release> <revision> -n <ns>
# Kubernetes rollout history
kubectl rollout history deploy/<name> -n <ns>
# Kubernetes rollback
kubectl rollout undo deploy/<name> -n <ns>
# Check image pull errors
kubectl describe pod <pod> -n <ns> | grep -A 10 Events
# Decode pull secret
kubectl get secret <name> -n <ns> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
# Check recent commits on main
git log --oneline -10
Escalation Contacts¶
| Situation | Team | Channel |
|---|---|---|
| Build blocking production hotfix | Dev lead + DevOps | #incidents |
| Registry down | Platform / Infra | #infra-oncall |
| CI pipeline down during incident | On-call lead | Direct page |
| Rollback failed | Dev lead + on-call lead | PagerDuty: app-critical |
Safe vs Dangerous Actions¶
| Safe (do without asking) | Dangerous (get approval) |
|---|---|
| View build logs | Merge with failing checks |
| Rerun failed builds | Rollback production deploy |
| Check Helm history | Rotate registry credentials |
| Describe pods for pull errors | Edit CI workflow files |
| Check GitHub status page | Force-push or amend commits |