On-Call Survival: CI/CD¶

Print this. Pin it. Read it at 3 AM.

Alert: Build Failure¶

Severity: P2 (blocks deploy) / P3 (non-blocking branch)

First command:

# GitHub Actions: gh run list --repo <org>/<repo> --limit 5
gh run list --limit 5
gh run view <run-id> --log-failed

What you're looking for: The first failing step and its error output.

Decision tree:

Is it a test failure?
├── Yes → Is the test flaky (passes on retry)?
│         gh run rerun <run-id> --failed  (rerun only failed jobs)
│         If still failing: failing test is real. Do NOT ship. Alert dev team.
└── No → Is it a lint / format error?
    ├── Yes → Run locally: ruff check . && ruff format --check .
    │         Fix the lint error. Push. Do not override CI to skip lint.
    └── No → Is it a dependency fetch failure (pip/npm install timeout)?
        ├── Yes → Retry the run: gh run rerun <run-id>
        │         Still failing? Registry or mirror down — see "Registry Issue" below.
        └── No → Is it a secrets / credentials error ("auth failed", "403 Forbidden")?
                 Expired token in CI secrets? Update the secret in repo settings.
                 Escalate: "Build failing with auth error in step <name>: <paste>"

Escalation trigger: Build blocking a production hotfix; flaky test that cannot be resolved quickly; secrets rotation needed for production.

Safe actions: View build logs, rerun failed builds.

Dangerous actions: Skip failing checks (--no-verify, force merge), rotate production secrets.

Alert: Deploy Failure / Helm Upgrade Failed¶

Severity: P1 (production deploy failed mid-rollout) / P2 (staging)

First command:

helm history <release> -n <ns>
helm status <release> -n <ns>

What you're looking for: The failed revision, its status description, and the previous good revision.

Decision tree:

Did the deploy start (any pods updated)?
├── No (failed before pods changed) → Check Helm error: helm status <release> -n <ns>
│   Schema validation error? Fix values and redeploy.
│   RBAC/permission error? Fix service account or escalate to infra.
└── Yes (partial rollout — some pods on new version, some on old)?
    → Is the new version serving errors?
    ├── Yes → ROLLBACK: helm rollback <release> <previous-revision> -n <ns>
    │          Verify: helm history <release> -n <ns>; kubectl get pods -n <ns>
    └── No → Wait for rollout to complete (it may self-heal).
             kubectl rollout status deploy/<name> -n <ns>
             Stuck > 10 min? → Rollback.
             Escalate: "Deploy <release> partially rolled out, monitoring for <X> min"

Escalation trigger: Production rollback fails; new version is running but throwing 5xx; cannot determine safe rollback point.

Safe actions: helm history, helm status, kubectl rollout status — read-only.

Dangerous actions: helm rollback (restores previous version — coordinate with team), helm upgrade --force (may cause downtime).

Alert: Container Registry Issue (ImagePullBackOff / 401 / 403)¶

Severity: P1 (blocks all new pod scheduling)

First command:

kubectl describe pod <pod-name> -n <ns> | grep -A 10 Events

What you're looking for: Failed to pull image ... 401 Unauthorized, 403 Forbidden, or not found.

Decision tree:

Is it 401/403 (auth failure)?
├── Yes → Is the imagePullSecret valid?
│         kubectl get secret <pull-secret> -n <ns> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
│         Expired token? Rotate registry credentials and update the secret.
└── No → Is it "image not found" / manifest unknown?
    ├── Yes → Wrong tag? Check the image tag in the deployment spec.
    │         kubectl get deploy <name> -n <ns> -o yaml | grep image:
    │         Pushed to wrong registry? Check CI build logs.
    └── No → Is the registry itself unreachable (timeout / connection refused)?
             curl -I https://<registry-host>/v2/
             Registry down? → Use cached image (pinned tag, already on nodes) or escalate.
             Escalate: "Registry <host> unreachable, new pods cannot pull images"

Escalation trigger: Registry unavailable and no cached images; cannot rotate credentials (token issuer unreachable); all environments affected.

Safe actions: Describe pod to see pull errors, inspect pull secret (decode, check expiry) — read-only.

Dangerous actions: Rotate registry credentials (brief auth disruption), change imagePullPolicy to Never (only uses cached images).

Alert: Rollback Needed¶

Severity: P1 (production is broken)

First command:

helm history <release> -n <ns>
kubectl rollout history deploy/<name> -n <ns>

What you're looking for: The previous stable revision number.

Decision tree:

Is the current version throwing errors (5xx, crashes)?
├── Yes → Identify last good version from history.
│         Helm: helm rollback <release> <revision> -n <ns>
│         kubectl: kubectl rollout undo deploy/<name> -n <ns>
│         Verify recovery: kubectl get pods -n <ns>; check error rate in Grafana.
└── No → Is the deploy stuck (not completing)?
    ├── Yes → kubectl rollout undo deploy/<name> -n <ns>; check if rollback completes.
    └── No → Not a rollback situation. Monitor and investigate.

After rollback: 1. Confirm pods are running: kubectl get pods -n <ns> 2. Check error rate dropped (Grafana or logs). 3. Create an incident ticket. 4. Notify deploy team: "Rolled back to revision due to ."

Escalation trigger: Rollback also fails; cannot determine a stable version; data corruption suspected.

Safe actions: View rollout history — read-only.

Dangerous actions: helm rollback, kubectl rollout undo (these change running workloads — coordinate with team).

Alert: CI/CD Pipeline Completely Down¶

Severity: P1 (cannot deploy or ship hotfixes)

First command:

gh run list --limit 10
# Check: are all recent runs failing in setup steps (before tests)?

What you're looking for: All runs failing at the same early step (checkout, auth, runner setup).

Decision tree:

Failing at checkout / auth step?
├── Yes → GitHub Actions outage? Check https://www.githubstatus.com
│         If GitHub outage: wait. Document incident. Cannot fix externally.
└── No → Self-hosted runners down?
    ├── Yes → Check runner pods: kubectl get pods -n ci | grep runner
    │         Runners crashed? kubectl rollout restart deploy/<runner> -n ci
    └── No → CI configuration error (edited workflow recently)?
             git log --oneline .github/workflows/ | head -5
             Last change broke it? Revert the workflow change.
             Escalate: "CI pipeline down, all runs failing at <step>"

Escalation trigger: CI down during a production incident requiring hotfix; GitHub outage with no alternative deploy path; self-hosted runners unrecoverable.

Safe actions: Check run status, view logs, check GitHub status page.

Dangerous actions: Edit CI workflow files, restart self-hosted runners, trigger manual deploys.

Quick Reference¶

Most Useful Commands¶

# List recent CI runs
gh run list --limit 10

# View failed run logs
gh run view <run-id> --log-failed

# Rerun only failed jobs
gh run rerun <run-id> --failed

# Helm release history
helm history <release> -n <ns>

# Helm rollback
helm rollback <release> <revision> -n <ns>

# Kubernetes rollout history
kubectl rollout history deploy/<name> -n <ns>

# Kubernetes rollback
kubectl rollout undo deploy/<name> -n <ns>

# Check image pull errors
kubectl describe pod <pod> -n <ns> | grep -A 10 Events

# Decode pull secret
kubectl get secret <name> -n <ns> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

# Check recent commits on main
git log --oneline -10

Escalation Contacts¶

Situation	Team	Channel
Build blocking production hotfix	Dev lead + DevOps	#incidents
Registry down	Platform / Infra	#infra-oncall
CI pipeline down during incident	On-call lead	Direct page
Rollback failed	Dev lead + on-call lead	PagerDuty: app-critical

Safe vs Dangerous Actions¶

Safe (do without asking)	Dangerous (get approval)
View build logs	Merge with failing checks
Rerun failed builds	Rollback production deploy
Check Helm history	Rotate registry credentials
Describe pods for pull errors	Edit CI workflow files
Check GitHub status page	Force-push or amend commits

Shift Handoff Template¶

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]