Skip to content

On-Call Survival: CI/CD

Print this. Pin it. Read it at 3 AM.


Alert: Build Failure

Severity: P2 (blocks deploy) / P3 (non-blocking branch)

First command:

# GitHub Actions: gh run list --repo <org>/<repo> --limit 5
gh run list --limit 5
gh run view <run-id> --log-failed
What you're looking for: The first failing step and its error output.

Decision tree:

Is it a test failure?
├── Yes  Is the test flaky (passes on retry)?
         gh run rerun <run-id> --failed  (rerun only failed jobs)
         If still failing: failing test is real. Do NOT ship. Alert dev team.
└── No  Is it a lint / format error?
    ├── Yes  Run locally: ruff check . && ruff format --check .
             Fix the lint error. Push. Do not override CI to skip lint.
    └── No  Is it a dependency fetch failure (pip/npm install timeout)?
        ├── Yes  Retry the run: gh run rerun <run-id>
                 Still failing? Registry or mirror down  see "Registry Issue" below.
        └── No  Is it a secrets / credentials error ("auth failed", "403 Forbidden")?
                 Expired token in CI secrets? Update the secret in repo settings.
                 Escalate: "Build failing with auth error in step <name>: <paste>"

Escalation trigger: Build blocking a production hotfix; flaky test that cannot be resolved quickly; secrets rotation needed for production.

Safe actions: View build logs, rerun failed builds.

Dangerous actions: Skip failing checks (--no-verify, force merge), rotate production secrets.


Alert: Deploy Failure / Helm Upgrade Failed

Severity: P1 (production deploy failed mid-rollout) / P2 (staging)

First command:

helm history <release> -n <ns>
helm status <release> -n <ns>
What you're looking for: The failed revision, its status description, and the previous good revision.

Decision tree:

Did the deploy start (any pods updated)?
├── No (failed before pods changed)  Check Helm error: helm status <release> -n <ns>
   Schema validation error? Fix values and redeploy.
   RBAC/permission error? Fix service account or escalate to infra.
└── Yes (partial rollout  some pods on new version, some on old)?
     Is the new version serving errors?
    ├── Yes  ROLLBACK: helm rollback <release> <previous-revision> -n <ns>
              Verify: helm history <release> -n <ns>; kubectl get pods -n <ns>
    └── No  Wait for rollout to complete (it may self-heal).
             kubectl rollout status deploy/<name> -n <ns>
             Stuck > 10 min?  Rollback.
             Escalate: "Deploy <release> partially rolled out, monitoring for <X> min"

Escalation trigger: Production rollback fails; new version is running but throwing 5xx; cannot determine safe rollback point.

Safe actions: helm history, helm status, kubectl rollout status — read-only.

Dangerous actions: helm rollback (restores previous version — coordinate with team), helm upgrade --force (may cause downtime).


Alert: Container Registry Issue (ImagePullBackOff / 401 / 403)

Severity: P1 (blocks all new pod scheduling)

First command:

kubectl describe pod <pod-name> -n <ns> | grep -A 10 Events
What you're looking for: Failed to pull image ... 401 Unauthorized, 403 Forbidden, or not found.

Decision tree:

Is it 401/403 (auth failure)?
├── Yes  Is the imagePullSecret valid?
         kubectl get secret <pull-secret> -n <ns> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d
         Expired token? Rotate registry credentials and update the secret.
└── No  Is it "image not found" / manifest unknown?
    ├── Yes  Wrong tag? Check the image tag in the deployment spec.
             kubectl get deploy <name> -n <ns> -o yaml | grep image:
             Pushed to wrong registry? Check CI build logs.
    └── No  Is the registry itself unreachable (timeout / connection refused)?
             curl -I https://<registry-host>/v2/
             Registry down?  Use cached image (pinned tag, already on nodes) or escalate.
             Escalate: "Registry <host> unreachable, new pods cannot pull images"

Escalation trigger: Registry unavailable and no cached images; cannot rotate credentials (token issuer unreachable); all environments affected.

Safe actions: Describe pod to see pull errors, inspect pull secret (decode, check expiry) — read-only.

Dangerous actions: Rotate registry credentials (brief auth disruption), change imagePullPolicy to Never (only uses cached images).


Alert: Rollback Needed

Severity: P1 (production is broken)

First command:

helm history <release> -n <ns>
kubectl rollout history deploy/<name> -n <ns>
What you're looking for: The previous stable revision number.

Decision tree:

Is the current version throwing errors (5xx, crashes)?
├── Yes  Identify last good version from history.
         Helm: helm rollback <release> <revision> -n <ns>
         kubectl: kubectl rollout undo deploy/<name> -n <ns>
         Verify recovery: kubectl get pods -n <ns>; check error rate in Grafana.
└── No  Is the deploy stuck (not completing)?
    ├── Yes  kubectl rollout undo deploy/<name> -n <ns>; check if rollback completes.
    └── No  Not a rollback situation. Monitor and investigate.

After rollback: 1. Confirm pods are running: kubectl get pods -n <ns> 2. Check error rate dropped (Grafana or logs). 3. Create an incident ticket. 4. Notify deploy team: "Rolled back to revision due to ."

Escalation trigger: Rollback also fails; cannot determine a stable version; data corruption suspected.

Safe actions: View rollout history — read-only.

Dangerous actions: helm rollback, kubectl rollout undo (these change running workloads — coordinate with team).


Alert: CI/CD Pipeline Completely Down

Severity: P1 (cannot deploy or ship hotfixes)

First command:

gh run list --limit 10
# Check: are all recent runs failing in setup steps (before tests)?
What you're looking for: All runs failing at the same early step (checkout, auth, runner setup).

Decision tree:

Failing at checkout / auth step?
├── Yes  GitHub Actions outage? Check https://www.githubstatus.com
         If GitHub outage: wait. Document incident. Cannot fix externally.
└── No  Self-hosted runners down?
    ├── Yes  Check runner pods: kubectl get pods -n ci | grep runner
             Runners crashed? kubectl rollout restart deploy/<runner> -n ci
    └── No  CI configuration error (edited workflow recently)?
             git log --oneline .github/workflows/ | head -5
             Last change broke it? Revert the workflow change.
             Escalate: "CI pipeline down, all runs failing at <step>"

Escalation trigger: CI down during a production incident requiring hotfix; GitHub outage with no alternative deploy path; self-hosted runners unrecoverable.

Safe actions: Check run status, view logs, check GitHub status page.

Dangerous actions: Edit CI workflow files, restart self-hosted runners, trigger manual deploys.


Quick Reference

Most Useful Commands

# List recent CI runs
gh run list --limit 10

# View failed run logs
gh run view <run-id> --log-failed

# Rerun only failed jobs
gh run rerun <run-id> --failed

# Helm release history
helm history <release> -n <ns>

# Helm rollback
helm rollback <release> <revision> -n <ns>

# Kubernetes rollout history
kubectl rollout history deploy/<name> -n <ns>

# Kubernetes rollback
kubectl rollout undo deploy/<name> -n <ns>

# Check image pull errors
kubectl describe pod <pod> -n <ns> | grep -A 10 Events

# Decode pull secret
kubectl get secret <name> -n <ns> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d

# Check recent commits on main
git log --oneline -10

Escalation Contacts

Situation Team Channel
Build blocking production hotfix Dev lead + DevOps #incidents
Registry down Platform / Infra #infra-oncall
CI pipeline down during incident On-call lead Direct page
Rollback failed Dev lead + on-call lead PagerDuty: app-critical

Safe vs Dangerous Actions

Safe (do without asking) Dangerous (get approval)
View build logs Merge with failing checks
Rerun failed builds Rollback production deploy
Check Helm history Rotate registry credentials
Describe pods for pull errors Edit CI workflow files
Check GitHub status page Force-push or amend commits

Shift Handoff Template

Status: [GREEN/YELLOW/RED]
Active incidents: [none / description]
Recent deploys: [list from last 24h]
Known flaky alerts: [list]
Things to watch: [anything unusual]