devops
l1
runbook
cicd
cicd-pipelines-realities --- Portal | Level: L1: Foundations | Topics: CI/CD, CI/CD Pipelines Realities | Domain: DevOps & Tooling

Runbook: Build Failure Triage¶

Field	Value
Domain	CI/CD
Alert	CI pipeline failing on main branch, or build success rate drops below threshold
Severity	P2
Est. Resolution Time	15-30 minutes
Escalation Timeout	45 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	CI system access (GitHub Actions / GitLab CI / Jenkins), repo write access, ability to re-run pipelines

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
git log --oneline -10

If output shows: the most recent commit is from an automated bot or dependency updater → Skip to Step 3 (dependency change is the likely cause) If output shows: no recent commits but the pipeline failed → This is likely an infrastructure issue, skip to Step 5

Step 1: Find the First Failing Step¶

Why: CI pipelines often fail at a late step due to an error introduced in an earlier step — always diagnose from the first failure, not the last.

# In your CI system UI: open the failed pipeline run, scan top-to-bottom for the first red step
# Then pull up that step's full log and scroll to the first ERROR line
# In GitHub Actions: click the failing job → expand failing step → look for "Error" or "##[error]"
# In GitLab CI: click the failing job → find the first non-green line with "ERROR"
# In Jenkins: open the build → click "Console Output" → Ctrl+F for "ERROR" or "FAILED"
echo "Review CI logs in the UI — identify the FIRST red step, not the last"

Expected output:

A specific error message — examples:
  "cannot find module 'some-package'"
  "FAIL src/api/handler_test.go"
  "Error response from daemon: manifest unknown"
  "npm ERR! code ENOTFOUND"

If this fails: If logs are missing or truncated, check whether the runner crashed mid-job — go to Step 5.

Step 2: Determine Flaky Test vs. Genuine Failure¶

Why: Re-running a flaky test wastes time; ignoring a genuine regression breaks production. You need to know which you're dealing with before fixing anything.

# Check if the same test passed in the previous pipeline run on this branch
# GitHub Actions: go to Actions tab → filter by branch → compare last two runs
# GitLab CI: Pipelines tab → find last two runs → compare the same job

# If you have access to the CI API, check recent run history for this specific job:
# GitHub CLI example:
gh run list --branch main --limit 10 --workflow <WORKFLOW_FILE_NAME>

Expected output:

A list of recent runs. If the failing test passed 2 runs ago and nothing changed,
it is likely flaky. If it has failed consistently since a specific commit, it is genuine.

If this fails: If gh CLI is not available, check the CI UI directly. If you cannot determine flakiness from run history, treat it as genuine and continue.

Step 3: Identify the Commit That Broke the Build¶

Why: Knowing the exact commit narrows the fix to a specific change rather than requiring you to audit the whole codebase.

# View recent commits to find the likely culprit
git log --oneline -20

# If the breaking commit is not obvious, use git bisect to find it automatically
# First, identify a known-good SHA from the CI history
git bisect start HEAD <LAST_GOOD_SHA>
git bisect run make test   # or whatever command runs the failing step locally

Expected output:

git bisect will print:
  "<SHA> is the first bad commit"
  commit <SHA>
  Author: ...
  Date: ...
      <commit message>

If this fails: If make test is not available locally, use git log --diff-filter=M --name-only HEAD~5..HEAD to see which files changed in the last 5 commits and cross-reference with the failing step.

Step 4: Reproduce Locally¶

Why: Reproducing locally confirms the failure is real and gives you a fast feedback loop to test fixes without waiting for CI queues.

# Pull the failing branch and run the exact failing step locally
git fetch origin
git checkout <FAILING_BRANCH>

# Run tests (pick the one that matches your repo):
make test
# or
npm test
# or
go test ./...
# or
docker build -t <IMAGE_NAME>:<TAG> .

Expected output:

The same failure you saw in CI — if you cannot reproduce locally,
the issue is likely environment-specific (see Step 5).

If this fails: If you cannot reproduce, the failure is likely environment-specific (runner config, secrets, network). Skip directly to Step 5.

Step 5: Check for Infrastructure Issues¶

Why: Build failures are not always code problems — a degraded runner, unreachable registry, or stale cache can cause identical-looking failures across unrelated commits.

# Check runner health (GitHub Actions self-hosted runners):
# GitHub UI: Settings → Actions → Runners → look for "Offline" or "Idle" runners

# GitLab runner status (run on the runner host):
gitlab-runner status

# GitHub self-hosted runner status (run on the runner host):
systemctl status actions.runner.<ORG>-<REPO>.<RUNNER_NAME>.service

# Check Docker registry reachability from your machine:
curl -I https://<REGISTRY_HOST>/v2/

# Check if dependency caches are stale (GitHub Actions):
# Go to Actions → Caches → look for caches older than expected TTL
echo "Check CI dashboard for runner status and cache health"

Expected output:

Runner status: active (running) — if not, the runner is the problem.
HTTP 200 or 401 from registry — if connection refused, registry is down.

If this fails: If the runner host is unreachable, escalate to the platform/infra team. If the registry is unreachable, check your cloud provider status page.

Step 6: Fix and Monitor¶

Why: After identifying the root cause, the fix must be confirmed by a passing pipeline — do not close the incident until CI is green.

# After making your fix, push and monitor the new pipeline run
git add <CHANGED_FILES>
git commit -m "fix: <description of what was broken and how you fixed it>"
git push origin <BRANCH_NAME>

# Monitor the pipeline until it passes
# GitHub CLI — watch a run:
gh run watch

# Or check run status:
gh run list --branch <BRANCH_NAME> --limit 5

Expected output:

All pipeline steps show green / "success".
gh run list shows "completed" with status "success".

If this fails: If the same step fails again with a different error, you may have fixed the first issue and revealed a second one — return to Step 1 and find the new first failure.

Verification¶

# Confirm the issue is resolved — pipeline is green on main
gh run list --branch main --limit 3

Success looks like: All recent runs show completed with status success. No new failures in the last 3 runs. If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 45 min	Platform/Infra on-call	"Main branch CI has been broken for >45 min, root cause unknown, need help with runner/infra investigation"
Runner infrastructure down	Platform/Infra on-call	"All CI runners are offline/unhealthy, all pipelines are blocked"
Security incident	Security on-call	"Security incident: possible supply-chain compromise in CI pipeline"
Scope expanding (multiple repos affected)	Platform/Infra on-call	"CI failure is affecting multiple repos — likely shared infrastructure issue, not code"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
If root cause was a flaky test: file a ticket to fix or quarantine the test
If root cause was an infrastructure issue: add alerting so it's caught proactively next time
Document the root cause and fix in the team incident log

Common Mistakes¶

Fixing the last failing step instead of the first: The last step fails because an earlier step produced bad output. Always scroll to the first error in the logs.
Retrying without understanding why it failed: Blind retries mask flakiness and make it harder to diagnose. Always read the logs before retrying.
Blaming infrastructure before checking recent commits: Most build failures are code changes, not infrastructure. Check commits first unless the failure hit multiple unrelated pipelines simultaneously.
Not reproducing locally before pushing a fix: Pushing speculative fixes creates noise in CI history and wastes runner time. Reproduce locally first.
Closing the incident before CI is green: A fix is not a fix until the pipeline passes. Stay on it until you see green.

Cross-References¶

Topic Pack: training/library/topic-packs/cicd-fundamentals/ (deep background on CI pipeline architecture)
Related Runbook: deploy-rollback.md — if the build was green but the deploy is broken
Related Runbook: registry-pull-failure.md — if the failure is specifically an image pull error
Related Runbook: pipeline-stuck.md — if the pipeline is not failing but is hanging

CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD, CI/CD Pipelines Realities
Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD
CI Pipeline Documentation (Reference, L1) — CI/CD
CI/CD Drills (Drill, L1) — CI/CD
CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
Circleci Flashcards (CLI) (flashcard_deck, L1) — CI/CD
Dagger / CI as Code (Topic Pack, L2) — CI/CD
Deep Dive: CI/CD Pipeline Architecture (deep_dive, L2) — CI/CD
GitHub Actions (Topic Pack, L1) — CI/CD
Interview: CI Vuln Scan Failed (Scenario, L2) — CI/CD

Runbook: Build Failure Triage¶

Quick Assessment (30 seconds)¶

Step 1: Find the First Failing Step¶

Step 2: Determine Flaky Test vs. Genuine Failure¶

Step 3: Identify the Commit That Broke the Build¶

Step 4: Reproduce Locally¶

Step 5: Check for Infrastructure Issues¶

Step 6: Fix and Monitor¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Build Failure Triage¶

Quick Assessment (30 seconds)¶

Step 1: Find the First Failing Step¶

Step 2: Determine Flaky Test vs. Genuine Failure¶

Step 3: Identify the Commit That Broke the Build¶

Step 4: Reproduce Locally¶

Step 5: Check for Infrastructure Issues¶

Step 6: Fix and Monitor¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶