Skip to content

Runbook: Build Failure Triage

Field Value
Domain CI/CD
Alert CI pipeline failing on main branch, or build success rate drops below threshold
Severity P2
Est. Resolution Time 15-30 minutes
Escalation Timeout 45 minutes — page if not resolved
Last Tested 2026-03-19
Prerequisites CI system access (GitHub Actions / GitLab CI / Jenkins), repo write access, ability to re-run pipelines

Quick Assessment (30 seconds)

# Run this first — it tells you the scope of the problem
git log --oneline -10
If output shows: the most recent commit is from an automated bot or dependency updater → Skip to Step 3 (dependency change is the likely cause) If output shows: no recent commits but the pipeline failed → This is likely an infrastructure issue, skip to Step 5

Step 1: Find the First Failing Step

Why: CI pipelines often fail at a late step due to an error introduced in an earlier step — always diagnose from the first failure, not the last.

# In your CI system UI: open the failed pipeline run, scan top-to-bottom for the first red step
# Then pull up that step's full log and scroll to the first ERROR line
# In GitHub Actions: click the failing job → expand failing step → look for "Error" or "##[error]"
# In GitLab CI: click the failing job → find the first non-green line with "ERROR"
# In Jenkins: open the build → click "Console Output" → Ctrl+F for "ERROR" or "FAILED"
echo "Review CI logs in the UI — identify the FIRST red step, not the last"
Expected output:
A specific error message — examples:
  "cannot find module 'some-package'"
  "FAIL src/api/handler_test.go"
  "Error response from daemon: manifest unknown"
  "npm ERR! code ENOTFOUND"
If this fails: If logs are missing or truncated, check whether the runner crashed mid-job — go to Step 5.

Step 2: Determine Flaky Test vs. Genuine Failure

Why: Re-running a flaky test wastes time; ignoring a genuine regression breaks production. You need to know which you're dealing with before fixing anything.

# Check if the same test passed in the previous pipeline run on this branch
# GitHub Actions: go to Actions tab → filter by branch → compare last two runs
# GitLab CI: Pipelines tab → find last two runs → compare the same job

# If you have access to the CI API, check recent run history for this specific job:
# GitHub CLI example:
gh run list --branch main --limit 10 --workflow <WORKFLOW_FILE_NAME>
Expected output:
A list of recent runs. If the failing test passed 2 runs ago and nothing changed,
it is likely flaky. If it has failed consistently since a specific commit, it is genuine.
If this fails: If gh CLI is not available, check the CI UI directly. If you cannot determine flakiness from run history, treat it as genuine and continue.

Step 3: Identify the Commit That Broke the Build

Why: Knowing the exact commit narrows the fix to a specific change rather than requiring you to audit the whole codebase.

# View recent commits to find the likely culprit
git log --oneline -20

# If the breaking commit is not obvious, use git bisect to find it automatically
# First, identify a known-good SHA from the CI history
git bisect start HEAD <LAST_GOOD_SHA>
git bisect run make test   # or whatever command runs the failing step locally
Expected output:
git bisect will print:
  "<SHA> is the first bad commit"
  commit <SHA>
  Author: ...
  Date: ...
      <commit message>
If this fails: If make test is not available locally, use git log --diff-filter=M --name-only HEAD~5..HEAD to see which files changed in the last 5 commits and cross-reference with the failing step.

Step 4: Reproduce Locally

Why: Reproducing locally confirms the failure is real and gives you a fast feedback loop to test fixes without waiting for CI queues.

# Pull the failing branch and run the exact failing step locally
git fetch origin
git checkout <FAILING_BRANCH>

# Run tests (pick the one that matches your repo):
make test
# or
npm test
# or
go test ./...
# or
docker build -t <IMAGE_NAME>:<TAG> .
Expected output:
The same failure you saw in CI — if you cannot reproduce locally,
the issue is likely environment-specific (see Step 5).
If this fails: If you cannot reproduce, the failure is likely environment-specific (runner config, secrets, network). Skip directly to Step 5.

Step 5: Check for Infrastructure Issues

Why: Build failures are not always code problems — a degraded runner, unreachable registry, or stale cache can cause identical-looking failures across unrelated commits.

# Check runner health (GitHub Actions self-hosted runners):
# GitHub UI: Settings → Actions → Runners → look for "Offline" or "Idle" runners

# GitLab runner status (run on the runner host):
gitlab-runner status

# GitHub self-hosted runner status (run on the runner host):
systemctl status actions.runner.<ORG>-<REPO>.<RUNNER_NAME>.service

# Check Docker registry reachability from your machine:
curl -I https://<REGISTRY_HOST>/v2/

# Check if dependency caches are stale (GitHub Actions):
# Go to Actions → Caches → look for caches older than expected TTL
echo "Check CI dashboard for runner status and cache health"
Expected output:
Runner status: active (running) — if not, the runner is the problem.
HTTP 200 or 401 from registry — if connection refused, registry is down.
If this fails: If the runner host is unreachable, escalate to the platform/infra team. If the registry is unreachable, check your cloud provider status page.

Step 6: Fix and Monitor

Why: After identifying the root cause, the fix must be confirmed by a passing pipeline — do not close the incident until CI is green.

# After making your fix, push and monitor the new pipeline run
git add <CHANGED_FILES>
git commit -m "fix: <description of what was broken and how you fixed it>"
git push origin <BRANCH_NAME>

# Monitor the pipeline until it passes
# GitHub CLI — watch a run:
gh run watch

# Or check run status:
gh run list --branch <BRANCH_NAME> --limit 5
Expected output:
All pipeline steps show green / "success".
gh run list shows "completed" with status "success".
If this fails: If the same step fails again with a different error, you may have fixed the first issue and revealed a second one — return to Step 1 and find the new first failure.

Verification

# Confirm the issue is resolved — pipeline is green on main
gh run list --branch main --limit 3
Success looks like: All recent runs show completed with status success. No new failures in the last 3 runs. If still broken: Escalate — see below.

Escalation

Condition Who to Page What to Say
Not resolved in 45 min Platform/Infra on-call "Main branch CI has been broken for >45 min, root cause unknown, need help with runner/infra investigation"
Runner infrastructure down Platform/Infra on-call "All CI runners are offline/unhealthy, all pipelines are blocked"
Security incident Security on-call "Security incident: possible supply-chain compromise in CI pipeline"
Scope expanding (multiple repos affected) Platform/Infra on-call "CI failure is affecting multiple repos — likely shared infrastructure issue, not code"

Post-Incident

  • Update monitoring if alert was noisy or missing
  • File postmortem if P1/P2
  • Update this runbook if steps were wrong or incomplete
  • If root cause was a flaky test: file a ticket to fix or quarantine the test
  • If root cause was an infrastructure issue: add alerting so it's caught proactively next time
  • Document the root cause and fix in the team incident log

Common Mistakes

  1. Fixing the last failing step instead of the first: The last step fails because an earlier step produced bad output. Always scroll to the first error in the logs.
  2. Retrying without understanding why it failed: Blind retries mask flakiness and make it harder to diagnose. Always read the logs before retrying.
  3. Blaming infrastructure before checking recent commits: Most build failures are code changes, not infrastructure. Check commits first unless the failure hit multiple unrelated pipelines simultaneously.
  4. Not reproducing locally before pushing a fix: Pushing speculative fixes creates noise in CI history and wastes runner time. Reproduce locally first.
  5. Closing the incident before CI is green: A fix is not a fix until the pipeline passes. Stay on it until you see green.

Cross-References

  • Topic Pack: training/library/topic-packs/cicd-fundamentals/ (deep background on CI pipeline architecture)
  • Related Runbook: deploy-rollback.md — if the build was green but the deploy is broken
  • Related Runbook: registry-pull-failure.md — if the failure is specifically an image pull error
  • Related Runbook: pipeline-stuck.md — if the pipeline is not failing but is hanging

Wiki Navigation