- devops
- l1
- runbook
- cicd --- Portal | Level: L1: Foundations | Topics: CI/CD | Domain: DevOps & Tooling
Runbook: Pipeline Stuck / Hung Job¶
| Field | Value |
|---|---|
| Domain | CI/CD |
| Alert | Pipeline running for >2x expected duration, or pending job with available runner |
| Severity | P2 |
| Est. Resolution Time | 15-30 minutes |
| Escalation Timeout | 45 minutes — page if not resolved |
| Last Tested | 2026-03-19 |
| Prerequisites | CI system admin access, ability to cancel and re-run jobs, runner host access (SSH or cloud console) |
Quick Assessment (30 seconds)¶
# Run this first — it tells you the scope of the problem
# GitHub Actions: check the Actions tab for running jobs
# GitLab CI: check CI/CD → Pipelines → Running
# Jenkins: check the build queue and executor status
# Check runner health (self-hosted runners):
# GitHub Actions:
gh run list --status in_progress --limit 10
# GitLab runner status (on the runner host):
gitlab-runner status
Step 1: Identify the Stuck Step and How Long It Has Been Running¶
Why: A pipeline that is "stuck" could be genuinely hung (infinite loop, deadlock, waiting for user input) or just slow (large test suite, slow network). Knowing the step and duration tells you which it is.
# GitHub Actions — view currently running jobs:
gh run list --status in_progress --limit 10
# View the specific run in detail:
gh run view <RUN_ID>
# View live logs for a job (will stream until the job ends):
gh run watch <RUN_ID>
# GitLab CI — check via API if you have a token:
curl --header "PRIVATE-TOKEN: <GITLAB_TOKEN>" \
"https://<GITLAB_HOST>/api/v4/projects/<PROJECT_ID>/pipelines?status=running"
A list showing the pipeline ID, which job is current, how long it has been running.
Example:
STATUS NAME WORKFLOW BRANCH EVENT ID ELAPSED
* Build and Test ci.yml main push 1234567890 47m
gh CLI is not available, check the CI system's web UI directly. Note the job name, step name, and start time.
Step 2: Check If the Job Is Actually Executing or Stuck in Queue¶
Why: A job can be "running" in the CI system's view but stuck in the pending/queued state if no runner picked it up. These have different fixes.
# GitHub Actions — check if a runner is assigned:
# In the GitHub UI: go to the running workflow → check if the job shows "Queued" or "Running" status
# "Queued" with available runners = runner label mismatch or no capacity
# "Running" but no log output advancing = job is hung mid-execution
# Check runner registration and capacity:
# GitHub: Settings → Actions → Runners (check for offline or busy runners)
# GitLab — check pending jobs vs runner capacity:
curl --header "PRIVATE-TOKEN: <GITLAB_TOKEN>" \
"https://<GITLAB_HOST>/api/v4/runners?scope=online"
# Jenkins — check the build executor status:
# In Jenkins UI: click "Build Executor Status" in the left sidebar
echo "Check CI UI for queued vs running state of the stuck job"
If stuck in queue: the job shows "Queued" or "Waiting for runner" with no log output.
If hung mid-execution: the job shows "Running" with log output that stopped advancing.
Step 3: Check Runner Logs for the Hung Job¶
Why: The runner's local logs often show what the job process is doing even when the CI system UI isn't updating — this tells you if the runner is frozen or the job process is frozen.
# Self-hosted GitHub Actions runner — check service status and recent logs:
systemctl status actions.runner.<ORG>-<REPO>.<RUNNER_NAME>.service
journalctl -u actions.runner.<ORG>-<REPO>.<RUNNER_NAME>.service -n 100 --no-pager
# GitLab runner — check status and logs:
gitlab-runner status
journalctl -u gitlab-runner -n 100 --no-pager
# Or check the log file directly:
tail -100 /var/log/gitlab-runner/gitlab-runner.log
# Jenkins — check agent logs in the Jenkins UI:
# Go to Jenkins → Manage Jenkins → Nodes → click the agent → log
Runner logs should show activity for the running job.
If runner is healthy: logs show the job commands executing.
If runner is frozen: logs stop at a specific command — that command is the hung step.
Step 4: Look for Hanging Tests or Processes in Job Output¶
Why: The most common cause of a hung pipeline is a test or process that started but never finished — network wait, deadlock, or missing --timeout flag. The last lines of the job log tell you what it's waiting for.
# Stream the last N lines of the job log to see what it's currently doing:
# GitHub Actions:
gh run view <RUN_ID> --log | tail -50
# GitLab CI — get the job trace via API:
curl --header "PRIVATE-TOKEN: <GITLAB_TOKEN>" \
"https://<GITLAB_HOST>/api/v4/projects/<PROJECT_ID>/jobs/<JOB_ID>/trace" | tail -50
# Look for common hang patterns:
# - "Waiting for connection..." → network call that never returned
# - "Acquiring lock..." → deadlock on a shared resource
# - No output for >5 min → process is hung without output
# - "Run test..." without completion → test is in an infinite loop
echo "Check last 50 lines of job output for the stuck point"
The last lines will point to the exact command or test that is hung:
"[09:45:23] Waiting for database to be ready..." → stuck waiting for a service
"RUNNING: test_heavy_integration_suite" → test suite with no timeout
"[10:02:11] Downloading artifact from s3://..." → network call timed out
Step 5: Cancel the Stuck Job and Re-Run¶
Why: If you've identified a hanging step, canceling and re-running tells you if it's a transient issue (passes on retry) or a systematic hang (fails every time in the same place).
# Cancel the stuck run:
# GitHub Actions:
gh run cancel <RUN_ID>
# GitLab CI — cancel via API:
curl --request POST --header "PRIVATE-TOKEN: <GITLAB_TOKEN>" \
"https://<GITLAB_HOST>/api/v4/projects/<PROJECT_ID>/pipelines/<PIPELINE_ID>/cancel"
# Re-run (after canceling):
# GitHub Actions:
gh run rerun <RUN_ID>
# Watch the new run to see if it gets stuck again in the same place:
gh run watch
After cancel: run status changes to "cancelled".
After re-run: a new run ID is created and begins executing.
Step 6: Address the Root Cause¶
Why: Canceling and retrying is a workaround, not a fix — without addressing root cause, pipelines will continue to hang.
# If a hanging test is the cause — add a timeout to the test or the job:
# Example: add a timeout to a GitHub Actions job (in the workflow YAML):
# jobs:
# build:
# timeout-minutes: 30 # <-- add this line
# If a hanging network call is the cause — add timeouts to the command:
# Example with curl: curl --max-time 30 <URL>
# Example with wget: wget --timeout=30 <URL>
# If the runner is the problem — restart the runner service:
# GitHub Actions runner:
systemctl restart actions.runner.<ORG>-<REPO>.<RUNNER_NAME>.service
# GitLab runner:
systemctl restart gitlab-runner
# If the runner host is unresponsive — restart the instance:
# AWS:
aws ec2 reboot-instances --instance-ids <INSTANCE_ID>
# Or terminate and let autoscaling replace it:
aws ec2 terminate-instances --instance-ids <INSTANCE_ID>
After adding job timeout: future runs will be cancelled automatically if they exceed the limit.
After restarting runner: runner service shows "active (running)".
After instance restart: runner re-registers with the CI system and appears online.
Verification¶
# Confirm the issue is resolved — pipeline completes within expected time
gh run list --branch main --limit 5
completed with status success and elapsed times within the expected range (not 2-3x the normal duration).
If still broken: Escalate — see below.
Escalation¶
| Condition | Who to Page | What to Say |
|---|---|---|
| Not resolved in 45 min | Platform/Infra on-call | "P2: Pipeline hung for >45 min, runner restarts not helping, need platform team investigation" |
| All runners offline | Platform/Infra on-call | "All CI runners are offline — no jobs can execute, pipelines queuing up" |
| Security incident | Security on-call | "Security incident: pipeline hung at an unusual step — possible supply-chain attack or exfiltration attempt" |
| Scope expanding (all repos blocked) | Platform/Infra on-call | "Runner exhaustion affecting all repos — pipelines are queued and not executing" |
Post-Incident¶
- Update monitoring if alert was noisy or missing
- File postmortem if P1/P2
- Update this runbook if steps were wrong or incomplete
- Add job-level
timeout-minutesto all workflows that are missing it - Add alerting for jobs running beyond 2x expected duration
- If a test was the cause: file a ticket to add a timeout or fix the test
- If a runner was the cause: review runner fleet health monitoring
Common Mistakes¶
- Re-running without investigating: Blindly retrying a hung job can exhaust runner capacity and block other teams. Always check the logs before retrying.
- Not distinguishing runner problem from job problem: A runner issue affects all jobs; a job issue affects only that job. Checking runner health first prevents misdiagnosis.
- Not setting job timeouts to prevent future hangs: The most impactful fix is adding a
timeout-minutesto the CI job definition — this caps the blast radius of any future hang. - Restarting the runner mid-job: Restarting a runner while it is actively running a job causes the job to fail immediately. Cancel the job first, then restart the runner.
- Not checking if the hung step is deterministic: If the same step hangs on every run, the fix must be in the code or configuration — retrying will never succeed.
Cross-References¶
- Topic Pack:
training/library/topic-packs/cicd-fundamentals/(deep background on CI runner architecture) - Related Runbook: build-failure-triage.md — if the pipeline fails instead of hanging
- Related Runbook: registry-pull-failure.md — if the hang is at the image pull step
Wiki Navigation¶
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD
- CI Pipeline Documentation (Reference, L1) — CI/CD
- CI/CD Drills (Drill, L1) — CI/CD
- CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
- CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD
- Circleci Flashcards (CLI) (flashcard_deck, L1) — CI/CD
- Dagger / CI as Code (Topic Pack, L2) — CI/CD
- Deep Dive: CI/CD Pipeline Architecture (deep_dive, L2) — CI/CD
- GitHub Actions (Topic Pack, L1) — CI/CD
- Interview: CI Vuln Scan Failed (Scenario, L2) — CI/CD