Skip to content

GitHub Actions - Street-Level Ops

What experienced GitHub Actions operators know that tutorials don't teach.

Quick Diagnosis Commands

# List recent workflow runs for a repo
gh run list --repo owner/repo --limit 20

# View a specific run's logs
gh run view <run-id> --log

# Watch a run in progress
gh run watch <run-id>

# Re-run a failed job (just the failed jobs)
gh run rerun <run-id> --failed

# Re-run entire workflow
gh run rerun <run-id>

# List all workflows
gh workflow list --repo owner/repo

# Trigger a workflow manually (workflow_dispatch)
gh workflow run deploy.yml --repo owner/repo -f environment=staging

# View workflow run details (steps, timing)
gh run view <run-id> --repo owner/repo

# Download artifacts from a run
gh run download <run-id> --repo owner/repo --dir ./artifacts

# Check self-hosted runner status
gh api repos/owner/repo/actions/runners | jq '.runners[] | {name, status, busy}'

# List queued/in-progress runs
gh run list --repo owner/repo --status in_progress
gh run list --repo owner/repo --status queued

# Cancel a stuck run
gh run cancel <run-id> --repo owner/repo

Common Scenarios

Scenario 1: Workflow Stuck in Queued State

A workflow has been queued for 10+ minutes without starting.

Diagnosis:

# Check if runners are available and online
gh api repos/owner/repo/actions/runners | jq '.runners[] | {name, status, busy, labels}'

# Check org-level runners too (if applicable)
gh api orgs/myorg/actions/runners | jq '.runners[] | {name, status, busy}'

# Look at the job's runner requirements in the YAML
# The 'runs-on' label must match an available runner

# For GitHub-hosted: check GitHub status page
# https://www.githubstatus.com/

Common causes and fixes:

1. No matching runner label:
   - Job requires 'self-hosted, linux, gpu' but only 'self-hosted, linux' is registered
   - Fix: update runner labels or change the 'runs-on' value in the workflow

2. All self-hosted runners are busy or offline:
   - Scale up runner pool or wait
   - Check runner machine is up: ssh to runner host, check runner service
     sudo systemctl status actions.runner.*.service
     sudo journalctl -u actions.runner.*.service -n 50

3. Concurrency group blocking:
   - Another run holds the concurrency lock
   - gh run list --repo owner/repo --status in_progress
   - Cancel the blocking run or wait

4. GitHub-hosted runner availability (Actions outage):
   - Check https://www.githubstatus.com/

Under the hood: GitHub-hosted runners are ephemeral VMs -- each job gets a fresh VM that is destroyed after the job completes. Self-hosted runners persist by default, which means file system state, Docker images, and tool versions accumulate between jobs. This is both a feature (faster warm cache) and a trap (dependency contamination between unrelated workflows).

Scenario 2: Secret Not Available in Workflow

A step fails because an env var is empty; you set the secret in the repo settings.

Diagnosis:

# Confirm the secret name matches exactly (case-sensitive)
gh secret list --repo owner/repo

# For org secrets, check if repo has access
gh api orgs/myorg/actions/secrets/MY_SECRET/repositories \
  | jq '.repositories[].name'

# Add debug step to print which vars are set (never print values)
- name: Debug env
  run: env | grep -i "INPUT_\|RUNNER_\|GITHUB_" | sort

Fix:

# Wrong — secrets need explicit mapping into env
steps:
  - run: deploy.sh
    env:
      API_KEY: ${{ secrets.API_KEY }}   # correct mapping

# Environment secrets require the job to target the environment
jobs:
  deploy:
    environment: production             # gate to access environment secrets
    steps:
      - run: deploy.sh
        env:
          PROD_KEY: ${{ secrets.PROD_KEY }}

OIDC for cloud auth (preferred over long-lived secrets):

permissions:
  id-token: write
  contents: read

steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123456789012:role/github-actions-role
      aws-region: us-east-1
  - run: aws s3 ls

Scenario 3: Cache Miss Every Run

You set up actions/cache but cache hit rate is 0%.

Diagnosis:

# Check cache key — if it includes content that changes every run, you'll always miss
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: ${{ runner.os }}-npm-${{ hashFiles('**/package-lock.json') }}
    # restore-keys provides fallback on partial match
    restore-keys: |
      ${{ runner.os }}-npm-

Common causes:

1. Key includes ${{ github.sha }} — changes every commit, always a miss
   Fix: key on lockfile hash, not commit SHA

2. Different runner OS (ubuntu-22 vs ubuntu-24) — cache is OS-keyed
   Fix: pin runner version: ubuntu-22.04 not ubuntu-latest

3. Paths don't match — caching ~/.npm but restoring to ~/different/path
   Fix: verify 'path' matches where the tool actually writes

4. Cache eviction — GitHub evicts caches not accessed in 7 days, or >10GB total
   Fix: nothing to do, cache will rebuild on next hit

5. Branch-scoped caches — caches created on feature branches don't restore on main
   Fix: use restore-keys to fall back to main's cache

Default trap: GitHub Actions caches are scoped to the branch where they were created. A cache saved on a feature branch is not available to main. But main's cache IS available to feature branches via restore-keys fallback. This means the first CI run on main after a long period always misses -- build a cache on main via a scheduled workflow to keep it warm.

Scenario 4: Matrix Build Partial Failure

A matrix strategy has 10 jobs; 2 fail, 8 pass. You want to rerun only the failures.

# Rerun only failed jobs
gh run rerun <run-id> --failed

# If you need to debug one matrix leg interactively, use tmate
- name: Debug via tmate
  if: failure()
  uses: mxschmitt/action-tmate@v3
  with:
    limit-access-to-actor: true

# To continue other matrix jobs when one fails (fail-fast: false)
strategy:
  fail-fast: false
  matrix:
    os: [ubuntu-latest, macos-latest, windows-latest]
    node: [18, 20, 22]

Key Patterns

Workflow Trigger Best Practices

on:
  push:
    branches: [main]
    paths:
      - 'src/**'
      - 'package*.json'         # skip CI when only docs change
  pull_request:
    branches: [main]
    types: [opened, synchronize, reopened]
  workflow_dispatch:            # manual trigger with optional inputs
    inputs:
      environment:
        description: 'Target environment'
        required: true
        default: 'staging'
        type: choice
        options: [staging, production]
  schedule:
    - cron: '0 6 * * 1'        # Monday 6 AM UTC for weekly jobs

Concurrency Control

# Cancel in-progress runs for the same PR/branch (safe for CI, dangerous for deploy)
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

# For deployments: queue instead of cancel
concurrency:
  group: deploy-${{ inputs.environment }}
  cancel-in-progress: false    # wait for current deploy to finish

Reusable Workflows

# Caller workflow
jobs:
  call-deploy:
    uses: myorg/workflows/.github/workflows/deploy.yml@main
    with:
      environment: production
      version: ${{ needs.build.outputs.version }}
    secrets: inherit             # or explicitly: secrets: { TOKEN: ${{ secrets.TOKEN }} }

# Reusable workflow definition
on:
  workflow_call:
    inputs:
      environment:
        required: true
        type: string
    secrets:
      TOKEN:
        required: true

Artifact Upload/Download

# Upload build artifacts
- uses: actions/upload-artifact@v4
  with:
    name: dist-${{ github.sha }}
    path: dist/
    retention-days: 7

# Download in a subsequent job
jobs:
  build:
    outputs:
      artifact-name: dist-${{ github.sha }}
    steps:
      - uses: actions/upload-artifact@v4
        with:
          name: dist-${{ github.sha }}
          path: dist/

  deploy:
    needs: build
    steps:
      - uses: actions/download-artifact@v4
        with:
          name: dist-${{ github.sha }}
          path: ./dist

Environment Protection Rules

# Workflow targets environment — triggers required reviewers, wait timers
jobs:
  deploy-prod:
    environment:
      name: production
      url: https://myapp.example.com
    steps:
      - run: ./deploy.sh production

Configure in GitHub: Settings → Environments → production → Required reviewers, wait timer (max 30 days), deployment branch policy.

Self-Hosted Runner Registration

# Register a new runner (runner machine)
mkdir -p ~/actions-runner && cd ~/actions-runner
curl -o actions-runner-linux-x64.tar.gz -L \
  https://github.com/actions/runner/releases/download/v2.317.0/actions-runner-linux-x64-2.317.0.tar.gz
tar xzf ./actions-runner-linux-x64.tar.gz
./config.sh --url https://github.com/owner/repo --token <TOKEN>

# Install as service
sudo ./svc.sh install
sudo ./svc.sh start
sudo systemctl status actions.runner.*.service

# Runner logs
sudo journalctl -u actions.runner.*.service -f

# Remove a runner
./config.sh remove --token <TOKEN>

Rate Limits and API Throttling

# Check your current rate limit status
gh api rate_limit | jq '.rate'

# GitHub Actions API limits:
# - 1000 API requests per hour per repo
# - 100 concurrent jobs per org (GitHub-hosted)
# - 256 jobs per workflow
# - 6 hours max job runtime (GitHub-hosted)
# - 35 day artifact retention (default 90 days)

# If hitting rate limits in workflows:
# - Cache aggressively (hashFiles on lockfiles)
# - Use github.token for API calls (higher limits than PAT)
# - Batch API calls where possible

Debugging with act (local runner)

# Install act
brew install act  # macOS
# or: curl https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash

# List available jobs
act -l

# Run a specific job locally
act push -j build

# With secrets file
act push -j build --secret-file .secrets

# Use a specific runner image (default is micro, use medium for more tools)
act push --platform ubuntu-latest=ghcr.io/catthehacker/ubuntu:act-latest