devops
l1
runbook
cicd --- Portal | Level: L1: Foundations | Topics: CI/CD | Domain: DevOps & Tooling

Runbook: Pipeline Stuck / Hung Job¶

Field	Value
Domain	CI/CD
Alert	Pipeline running for >2x expected duration, or pending job with available runner
Severity	P2
Est. Resolution Time	15-30 minutes
Escalation Timeout	45 minutes — page if not resolved
Last Tested	2026-03-19
Prerequisites	CI system admin access, ability to cancel and re-run jobs, runner host access (SSH or cloud console)

Quick Assessment (30 seconds)¶

# Run this first — it tells you the scope of the problem
# GitHub Actions: check the Actions tab for running jobs
# GitLab CI: check CI/CD → Pipelines → Running
# Jenkins: check the build queue and executor status

# Check runner health (self-hosted runners):
# GitHub Actions:
gh run list --status in_progress --limit 10

# GitLab runner status (on the runner host):
gitlab-runner status

If output shows: multiple pipelines stuck across different repos → This is a runner infrastructure problem, skip to Step 5 If output shows: one pipeline stuck on one repo → This is likely a hanging test or process in that specific job, proceed to Step 1

Step 1: Identify the Stuck Step and How Long It Has Been Running¶

Why: A pipeline that is "stuck" could be genuinely hung (infinite loop, deadlock, waiting for user input) or just slow (large test suite, slow network). Knowing the step and duration tells you which it is.

# GitHub Actions — view currently running jobs:
gh run list --status in_progress --limit 10

# View the specific run in detail:
gh run view <RUN_ID>

# View live logs for a job (will stream until the job ends):
gh run watch <RUN_ID>

# GitLab CI — check via API if you have a token:
curl --header "PRIVATE-TOKEN: <GITLAB_TOKEN>" \
  "https://<GITLAB_HOST>/api/v4/projects/<PROJECT_ID>/pipelines?status=running"

Expected output:

A list showing the pipeline ID, which job is current, how long it has been running.
Example:
  STATUS  NAME             WORKFLOW       BRANCH  EVENT  ID          ELAPSED
  *       Build and Test   ci.yml         main    push   1234567890  47m

If this fails: If gh CLI is not available, check the CI system's web UI directly. Note the job name, step name, and start time.

Step 2: Check If the Job Is Actually Executing or Stuck in Queue¶

Why: A job can be "running" in the CI system's view but stuck in the pending/queued state if no runner picked it up. These have different fixes.

# GitHub Actions — check if a runner is assigned:
# In the GitHub UI: go to the running workflow → check if the job shows "Queued" or "Running" status
# "Queued" with available runners = runner label mismatch or no capacity
# "Running" but no log output advancing = job is hung mid-execution

# Check runner registration and capacity:
# GitHub: Settings → Actions → Runners (check for offline or busy runners)

# GitLab — check pending jobs vs runner capacity:
curl --header "PRIVATE-TOKEN: <GITLAB_TOKEN>" \
  "https://<GITLAB_HOST>/api/v4/runners?scope=online"

# Jenkins — check the build executor status:
# In Jenkins UI: click "Build Executor Status" in the left sidebar
echo "Check CI UI for queued vs running state of the stuck job"

Expected output:

If stuck in queue: the job shows "Queued" or "Waiting for runner" with no log output.
If hung mid-execution: the job shows "Running" with log output that stopped advancing.

If this fails: If you cannot determine the state from the UI, proceed to Step 3 to check runner logs directly.

Step 3: Check Runner Logs for the Hung Job¶

Why: The runner's local logs often show what the job process is doing even when the CI system UI isn't updating — this tells you if the runner is frozen or the job process is frozen.

# Self-hosted GitHub Actions runner — check service status and recent logs:
systemctl status actions.runner.<ORG>-<REPO>.<RUNNER_NAME>.service
journalctl -u actions.runner.<ORG>-<REPO>.<RUNNER_NAME>.service -n 100 --no-pager

# GitLab runner — check status and logs:
gitlab-runner status
journalctl -u gitlab-runner -n 100 --no-pager
# Or check the log file directly:
tail -100 /var/log/gitlab-runner/gitlab-runner.log

# Jenkins — check agent logs in the Jenkins UI:
# Go to Jenkins → Manage Jenkins → Nodes → click the agent → log

Expected output:

Runner logs should show activity for the running job.
If runner is healthy: logs show the job commands executing.
If runner is frozen: logs stop at a specific command — that command is the hung step.

If this fails: If you cannot SSH to the runner host, check cloud console (EC2/GCE/Azure VM) for the instance status, then see Step 5.

Step 4: Look for Hanging Tests or Processes in Job Output¶

Why: The most common cause of a hung pipeline is a test or process that started but never finished — network wait, deadlock, or missing --timeout flag. The last lines of the job log tell you what it's waiting for.

# Stream the last N lines of the job log to see what it's currently doing:
# GitHub Actions:
gh run view <RUN_ID> --log | tail -50

# GitLab CI — get the job trace via API:
curl --header "PRIVATE-TOKEN: <GITLAB_TOKEN>" \
  "https://<GITLAB_HOST>/api/v4/projects/<PROJECT_ID>/jobs/<JOB_ID>/trace" | tail -50

# Look for common hang patterns:
# - "Waiting for connection..." → network call that never returned
# - "Acquiring lock..." → deadlock on a shared resource
# - No output for >5 min → process is hung without output
# - "Run test..." without completion → test is in an infinite loop
echo "Check last 50 lines of job output for the stuck point"

Expected output:

The last lines will point to the exact command or test that is hung:
  "[09:45:23] Waiting for database to be ready..."  → stuck waiting for a service
  "RUNNING: test_heavy_integration_suite"           → test suite with no timeout
  "[10:02:11] Downloading artifact from s3://..."   → network call timed out

If this fails: If job logs are not accessible, proceed directly to Step 5 to address the runner itself.

Step 5: Cancel the Stuck Job and Re-Run¶

Why: If you've identified a hanging step, canceling and re-running tells you if it's a transient issue (passes on retry) or a systematic hang (fails every time in the same place).

# Cancel the stuck run:
# GitHub Actions:
gh run cancel <RUN_ID>

# GitLab CI — cancel via API:
curl --request POST --header "PRIVATE-TOKEN: <GITLAB_TOKEN>" \
  "https://<GITLAB_HOST>/api/v4/projects/<PROJECT_ID>/pipelines/<PIPELINE_ID>/cancel"

# Re-run (after canceling):
# GitHub Actions:
gh run rerun <RUN_ID>

# Watch the new run to see if it gets stuck again in the same place:
gh run watch

Expected output:

After cancel: run status changes to "cancelled".
After re-run: a new run ID is created and begins executing.

If this fails: If the same step hangs again on re-run, the issue is reproducible. Skip the next retry and go straight to Step 6 to address the root cause (add a timeout, fix the test, or repair the runner).

Step 6: Address the Root Cause¶

Why: Canceling and retrying is a workaround, not a fix — without addressing root cause, pipelines will continue to hang.

# If a hanging test is the cause — add a timeout to the test or the job:
# Example: add a timeout to a GitHub Actions job (in the workflow YAML):
# jobs:
#   build:
#     timeout-minutes: 30   # <-- add this line

# If a hanging network call is the cause — add timeouts to the command:
# Example with curl: curl --max-time 30 <URL>
# Example with wget: wget --timeout=30 <URL>

# If the runner is the problem — restart the runner service:
# GitHub Actions runner:
systemctl restart actions.runner.<ORG>-<REPO>.<RUNNER_NAME>.service

# GitLab runner:
systemctl restart gitlab-runner

# If the runner host is unresponsive — restart the instance:
# AWS:
aws ec2 reboot-instances --instance-ids <INSTANCE_ID>
# Or terminate and let autoscaling replace it:
aws ec2 terminate-instances --instance-ids <INSTANCE_ID>

Expected output:

After adding job timeout: future runs will be cancelled automatically if they exceed the limit.
After restarting runner: runner service shows "active (running)".
After instance restart: runner re-registers with the CI system and appears online.

If this fails: If the runner instance cannot be restarted (instance is stuck), escalate to the platform team. If fixing the test timeout requires a code review, create a PR and add a job-level timeout as a temporary workaround.

Verification¶

# Confirm the issue is resolved — pipeline completes within expected time
gh run list --branch main --limit 5

Success looks like: Recent runs show completed with status success and elapsed times within the expected range (not 2-3x the normal duration). If still broken: Escalate — see below.

Escalation¶

Condition	Who to Page	What to Say
Not resolved in 45 min	Platform/Infra on-call	"P2: Pipeline hung for >45 min, runner restarts not helping, need platform team investigation"
All runners offline	Platform/Infra on-call	"All CI runners are offline — no jobs can execute, pipelines queuing up"
Security incident	Security on-call	"Security incident: pipeline hung at an unusual step — possible supply-chain attack or exfiltration attempt"
Scope expanding (all repos blocked)	Platform/Infra on-call	"Runner exhaustion affecting all repos — pipelines are queued and not executing"

Post-Incident¶

Update monitoring if alert was noisy or missing
File postmortem if P1/P2
Update this runbook if steps were wrong or incomplete
Add job-level timeout-minutes to all workflows that are missing it
Add alerting for jobs running beyond 2x expected duration
If a test was the cause: file a ticket to add a timeout or fix the test
If a runner was the cause: review runner fleet health monitoring

Common Mistakes¶

Re-running without investigating: Blindly retrying a hung job can exhaust runner capacity and block other teams. Always check the logs before retrying.
Not distinguishing runner problem from job problem: A runner issue affects all jobs; a job issue affects only that job. Checking runner health first prevents misdiagnosis.
Not setting job timeouts to prevent future hangs: The most impactful fix is adding a timeout-minutes to the CI job definition — this caps the blast radius of any future hang.
Restarting the runner mid-job: Restarting a runner while it is actively running a job causes the job to fail immediately. Cancel the job first, then restart the runner.
Not checking if the hung step is deterministic: If the same step hangs on every run, the fix must be in the code or configuration — retrying will never succeed.

Cross-References¶

Topic Pack: training/library/topic-packs/cicd-fundamentals/ (deep background on CI runner architecture)
Related Runbook: build-failure-triage.md — if the pipeline fails instead of hanging
Related Runbook: registry-pull-failure.md — if the hang is at the image pull step

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — CI/CD
CI Pipeline Documentation (Reference, L1) — CI/CD
CI/CD Drills (Drill, L1) — CI/CD
CI/CD Flashcards (CLI) (flashcard_deck, L1) — CI/CD
CI/CD Pipelines & Patterns (Topic Pack, L1) — CI/CD
Circleci Flashcards (CLI) (flashcard_deck, L1) — CI/CD
Dagger / CI as Code (Topic Pack, L2) — CI/CD
Deep Dive: CI/CD Pipeline Architecture (deep_dive, L2) — CI/CD
GitHub Actions (Topic Pack, L1) — CI/CD
Interview: CI Vuln Scan Failed (Scenario, L2) — CI/CD

Runbook: Pipeline Stuck / Hung Job¶

Quick Assessment (30 seconds)¶

Step 1: Identify the Stuck Step and How Long It Has Been Running¶

Step 2: Check If the Job Is Actually Executing or Stuck in Queue¶

Step 3: Check Runner Logs for the Hung Job¶

Step 4: Look for Hanging Tests or Processes in Job Output¶

Step 5: Cancel the Stuck Job and Re-Run¶

Step 6: Address the Root Cause¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Pages that link here¶

Runbook: Pipeline Stuck / Hung Job¶

Quick Assessment (30 seconds)¶

Step 1: Identify the Stuck Step and How Long It Has Been Running¶

Step 2: Check If the Job Is Actually Executing or Stuck in Queue¶

Step 3: Check Runner Logs for the Hung Job¶

Step 4: Look for Hanging Tests or Processes in Job Output¶

Step 5: Cancel the Stuck Job and Re-Run¶

Step 6: Address the Root Cause¶

Verification¶

Escalation¶

Post-Incident¶

Common Mistakes¶

Cross-References¶

Wiki Navigation¶

Related Content¶

Pages that link here¶