Skip to content

Thinking Out Loud: GitHub Actions

A senior SRE's internal monologue while working through a real GitHub Actions issue. This isn't a tutorial — it's a window into how experienced engineers actually think.

The Situation

The CI pipeline for our main application repository has been failing intermittently for the past week. About 30% of builds fail with different errors each time. The tests pass locally for everyone. The team is losing trust in the CI and starting to merge without waiting for green builds.

The Monologue

Intermittent CI failures with different errors. This smells like a flaky infrastructure problem, not flaky tests. Different errors each time = the tests aren't the problem. Let me look at the recent failures.

gh run list --repo our-org/api-service --status failure --limit 20 --json databaseId,headBranch,conclusion,createdAt,name | jq '.'

Failures across different branches, different workflows. Let me look at the specific error patterns.

for run_id in $(gh run list --repo our-org/api-service --status failure --limit 5 --json databaseId -q '.[].databaseId'); do
  echo "=== Run $run_id ==="
  gh run view $run_id --repo our-org/api-service --log-failed 2>/dev/null | tail -20
  echo
done

I see three different failure patterns: 1. "Error: The runner has received a shutdown signal" — runner being killed mid-job 2. "##[error]The operation was canceled" — timeout or cancellation 3. "Error: Process completed with exit code 137" — OOM killed (128+9 = SIGKILL)

Exit code 137 is the giveaway. The runner is running out of memory. Let me check the workflow definition.

gh api repos/our-org/api-service/contents/.github/workflows/ci.yml --jq '.content' | base64 -d | head -40

runs-on: ubuntu-latest. The standard GitHub-hosted runners have 7GB of RAM. Let me check what the test suite actually needs.

Mental Model: CI Resource Constraints Are Invisible

GitHub Actions runners have fixed resources (2 cores, 7GB RAM for standard). When your test suite grows or dependencies change, you can silently cross the resource boundary. The failures look random because they depend on which tests happen to run first, memory fragmentation, and concurrent processes. The fix is to either optimize resource usage or use larger runners.

Let me check if the test suite recently changed.

gh api repos/our-org/api-service/commits --per-page 30 | jq '.[] | select(.commit.message | test("test|spec|jest|cypress"; "i")) | {sha: .sha, message: .commit.message, date: .commit.committer.date}' | head -20

Two weeks ago: "feat: add integration tests with Cypress." And one week ago: "feat: add Playwright e2e tests alongside Cypress." They added TWO browser-based test suites. Browser tests are memory hogs — each Chromium instance uses 500MB-1GB.

Let me check the workflow file more carefully.

gh api repos/our-org/api-service/contents/.github/workflows/ci.yml --jq '.content' | base64 -d

The workflow runs: lint, unit tests, Cypress tests, and Playwright tests — all in the same job. That's the problem. The job is spinning up multiple Chromium instances plus the application server plus the test runner, all in 7GB of RAM.

I see the fix: split the tests into separate jobs so they don't compete for memory, and add resource controls.

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run lint

  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm test -- --coverage

  cypress-tests:
    runs-on: ubuntu-latest
    needs: [lint, unit-tests]
    steps:
      - uses: actions/checkout@v4
      - uses: cypress-io/github-action@v6
        with:
          start: npm start
          wait-on: 'http://localhost:3000'

  playwright-tests:
    runs-on: ubuntu-latest
    needs: [lint, unit-tests]
    steps:
      - uses: actions/checkout@v4
      - run: npx playwright install --with-deps
      - run: npx playwright test

Key changes: - Split into 4 jobs instead of 1 - Cypress and Playwright run in parallel (both need lint and unit to pass first) - Each job gets its own 7GB of RAM - Lint and unit tests are gates — if they fail, browser tests don't even start (saves CI minutes)

Mental Model: CI Job Design = Resource Isolation

Each CI job gets its own runner (fresh VM). Splitting workloads into separate jobs provides resource isolation — if one job is memory-hungry, it doesn't affect others. Jobs can run in parallel (faster) or sequentially with needs dependencies (safer). The cost of splitting is slightly more total CI time (setup overhead per job), but the benefit is reliability and clearer failure attribution.

But wait — there's another issue. Even split into separate jobs, a single Playwright suite might still OOM if there are many tests. Let me check.

gh api repos/our-org/api-service/contents/playwright.config.ts --jq '.content' | base64 -d | grep -i "workers\|parallel"

workers: process.env.CI ? 4 : undefined. Four parallel workers in CI, each running Chromium. That's 4 x 500MB = 2GB just for browsers, plus the app server, plus Node.js overhead. On a 7GB runner, that's tight.

Let me reduce the parallelism.

  playwright-tests:
    runs-on: ubuntu-latest
    needs: [lint, unit-tests]
    env:
      CI: true
    steps:
      - uses: actions/checkout@v4
      - run: npx playwright install --with-deps
      - run: npx playwright test --workers=2

Two workers should keep memory under control. Let me also add a swap file as a safety net — the GitHub runner has disk space we can use.

      - name: Create swap
        run: |
          sudo fallocate -l 4G /swapfile
          sudo chmod 600 /swapfile
          sudo mkswap /swapfile
          sudo swapon /swapfile

That gives us 4GB of emergency swap. The runner won't be fast when swapping, but it won't OOM-kill either.

Let me also add caching for the dependencies and browser installs — these are large downloads that don't change often.

      - uses: actions/cache@v4
        with:
          path: |
            ~/.cache/ms-playwright
            node_modules
          key: ${{ runner.os }}-playwright-${{ hashFiles('**/package-lock.json') }}

One more thing — I want to add monitoring for the CI pipeline itself. I'll track success rate and duration.

# Check recent success rate
gh run list --repo our-org/api-service --limit 50 --json conclusion -q '[.[] | .conclusion] | group_by(.) | map({status: .[0], count: length})'

Let me push these workflow changes and verify the next few builds pass.

gh pr create --title "fix(ci): split jobs and reduce memory pressure" \
  --body "Splits CI into separate jobs for lint, unit, cypress, and playwright. Reduces playwright workers from 4 to 2. Adds swap file as OOM safety net."

The team should also discuss whether they really need both Cypress AND Playwright. Running two browser test frameworks doubles the CI cost for overlapping coverage.

What Made This Senior-Level

Junior Would... Senior Does... Why
Add retry: 3 to the failing steps Investigate the exit codes and identify the resource constraint Retrying OOM kills just wastes CI minutes and makes failures take 3x longer
Not recognize exit code 137 as OOM kill Know that 128+signal = OOM (128+9=137=SIGKILL) Understanding signal math immediately points to the root cause
Try to debug the test code (it "works locally") Check the CI runner's resource constraints against the workload requirements "Works locally" with 32GB RAM, fails on CI with 7GB RAM — it's not the code, it's the environment
Just request larger runners Split jobs for resource isolation AND reduce parallelism Larger runners cost more; proper job design works within existing resources

Key Heuristics Used

  1. Exit Code 137 = OOM Kill: In CI environments, exit code 137 (128+SIGKILL) means the process was killed for exceeding memory limits. Always check this before debugging test logic.
  2. CI Job = Resource Boundary: Each job gets its own runner. Split memory-hungry work into separate jobs for resource isolation.
  3. CI Success Rate Is a Metric: Track CI success rate over time. A drop in success rate erodes team trust and leads to people merging without CI validation.

Cross-References

  • Primer — GitHub Actions architecture, runner types, and workflow syntax
  • Street Ops — Workflow debugging, gh CLI for run inspection, and caching patterns
  • Footguns — Running everything in one job, not caching dependencies, and ignoring CI resource limits