- devops
- l1
- runbook
- cicd
- cicd-pipelines-realities --- Portal | Level: L1: Foundations | Topics: CI/CD, CI/CD Pipelines Realities | Domain: DevOps & Tooling
Runbook: Build Failure Triage¶
| Field | Value |
|---|---|
| Domain | CI/CD |
| Alert | CI pipeline failing on main branch, or build success rate drops below threshold |
| Severity | P2 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 45 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | CI system access (GitHub Actions / GitLab CI / Jenkins), repo write access, ability to re-run pipelines |
Quick Assessment (30 seconds)¶
If output shows: the most recent commit is from an automated bot or dependency updater → Skip to Step 3 (dependency change is the likely cause) If output shows: no recent commits but the pipeline failed → This is likely an infrastructure issue, skip to Step 5Step 1: Find the First Failing Step¶
Why: CI pipelines often fail at a late step due to an error introduced in an earlier step — always diagnose from the first failure, not the last.
# In your CI system UI: open the failed pipeline run, scan top-to-bottom for the first red step
# Then pull up that step's full log and scroll to the first ERROR line
# In GitHub Actions: click the failing job → expand failing step → look for "Error" or "##[error]"
# In GitLab CI: click the failing job → find the first non-green line with "ERROR"
# In Jenkins: open the build → click "Console Output" → Ctrl+F for "ERROR" or "FAILED"
echo "Review CI logs in the UI — identify the FIRST red step, not the last"
A specific error message — examples:
"cannot find module 'some-package'"
"FAIL src/api/handler_test.go"
"Error response from daemon: manifest unknown"
"npm ERR! code ENOTFOUND"
Step 2: Determine Flaky Test vs. Genuine Failure¶
Why: Re-running a flaky test wastes time; ignoring a genuine regression breaks production. You need to know which you're dealing with before fixing anything.
# Check if the same test passed in the previous pipeline run on this branch
# GitHub Actions: go to Actions tab → filter by branch → compare last two runs
# GitLab CI: Pipelines tab → find last two runs → compare the same job
# If you have access to the CI API, check recent run history for this specific job:
# GitHub CLI example:
gh run list --branch main --limit 10 --workflow <WORKFLOW_FILE_NAME>
A list of recent runs. If the failing test passed 2 runs ago and nothing changed,
it is likely flaky. If it has failed consistently since a specific commit, it is genuine.
gh CLI is not available, check the CI UI directly. If you cannot determine flakiness from run history, treat it as genuine and continue.
Step 3: Identify the Commit That Broke the Build¶
Why: Knowing the exact commit narrows the fix to a specific change rather than requiring you to audit the whole codebase.
# View recent commits to find the likely culprit
git log --oneline -20
# If the breaking commit is not obvious, use git bisect to find it automatically
# First, identify a known-good SHA from the CI history
git bisect start HEAD <LAST_GOOD_SHA>
git bisect run make test # or whatever command runs the failing step locally
git bisect will print:
"<SHA> is the first bad commit"
commit <SHA>
Author: ...
Date: ...
<commit message>
make test is not available locally, use git log --diff-filter=M --name-only HEAD~5..HEAD to see which files changed in the last 5 commits and cross-reference with the failing step.
Step 4: Reproduce Locally¶
Why: Reproducing locally confirms the failure is real and gives you a fast feedback loop to test fixes without waiting for CI queues.
# Pull the failing branch and run the exact failing step locally
git fetch origin
git checkout <FAILING_BRANCH>
# Run tests (pick the one that matches your repo):
make test
# or
npm test
# or
go test ./...
# or
docker build -t <IMAGE_NAME>:<TAG> .
The same failure you saw in CI — if you cannot reproduce locally,
the issue is likely environment-specific (see Step 5).
Step 5: Check for Infrastructure Issues¶
Why: Build failures are not always code problems — a degraded runner, unreachable registry, or stale cache can cause identical-looking failures across unrelated commits.
# Check runner health (GitHub Actions self-hosted runners):
# GitHub UI: Settings → Actions → Runners → look for "Offline" or "Idle" runners
# GitLab runner status (run on the runner host):
gitlab-runner status
# GitHub self-hosted runner status (run on the runner host):
systemctl status actions.runner.<ORG>-<REPO>.<RUNNER_NAME>.service
# Check Docker registry reachability from your machine:
curl -I https://<REGISTRY_HOST>/v2/
# Check if dependency caches are stale (GitHub Actions):
# Go to Actions → Caches → look for caches older than expected TTL
echo "Check CI dashboard for runner status and cache health"
Runner status: active (running) — if not, the runner is the problem.
HTTP 200 or 401 from registry — if connection refused, registry is down.
Step 6: Fix and Monitor¶
Why: After identifying the root cause, the fix must be confirmed by a passing pipeline — do not close the incident until CI is green.
# After making your fix, push and monitor the new pipeline run
git add <CHANGED_FILES>
git commit -m "fix: <description of what was broken and how you fixed it>"
git push origin <BRANCH_NAME>
# Monitor the pipeline until it passes
# GitHub CLI — watch a run:
gh run watch
# Or check run status:
gh run list --branch <BRANCH_NAME> --limit 5
Verification¶
Success looks like: All recent runs showcompleted with status success. No new failures in the last 3 runs.
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 45 min | Platform/Infra on-call | "Main branch CI has been broken for >45 min, root cause unknown, need help with runner/infra investigation" |
| Runner infrastructure down | Platform/Infra on-call | "All CI runners are offline/unhealthy, all pipelines are blocked" |
| Security incident | Security on-call | "Security incident: possible supply-chain compromise in CI pipeline" |
| Scope expanding (multiple repos affected) | Platform/Infra on-call | "CI failure is affecting multiple repos — likely shared infrastructure issue, not code" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- If root cause was a flaky test: file a ticket to fix or quarantine the test
- If root cause was an infrastructure issue: add alerting so it's caught proactively next time
- Document the root cause and fix in the team incident log
Common Mistakes¶
- Fixing the last failing step instead of the first: The last step fails because an earlier step produced bad output. Always scroll to the first error in the logs.
- Retrying without understanding why it failed: Blind retries mask flakiness and make it harder to diagnose. Always read the logs before retrying.
- Blaming infrastructure before checking recent commits: Most build failures are code changes, not infrastructure. Check commits first unless the failure hit multiple unrelated pipelines simultaneously.
- Not reproducing locally before pushing a fix: Pushing speculative fixes creates noise in CI history and wastes runner time. Reproduce locally first.
- Closing the incident before CI is green: A fix is not a fix until the pipeline passes. Stay on it until you see green.
Cross-References¶
- Topic Pack:
training/library/topic-packs/cicd-fundamentals/(deep background on CI pipeline architecture) - Related Runbook: deploy-rollback.md — if the build was green but the deploy is broken
- Related Runbook: registry-pull-failure.md — if the failure is specifically an image pull error
- Related Runbook: pipeline-stuck.md — if the pipeline is not failing but is hanging
Wiki Navigation¶
Related Content¶
- CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD, CI/CD Pipelines Realities
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD
- CI Pipeline Documentation (Reference, L1) — CI/CD
- CI/CD Drills (Drill, L1) — CI/CD
- CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
- Circleci Flashcards (CLI) (flashcard_deck, L1) — CI/CD
- Dagger / CI as Code (Topic Pack, L2) — CI/CD
- Deep Dive: CI/CD Pipeline Architecture (deep_dive, L2) — CI/CD
- GitHub Actions (Topic Pack, L1) — CI/CD
- Interview: CI Vuln Scan Failed (Scenario, L2) — CI/CD