CI/CD Pipelines - Street Ops¶

What experienced pipeline engineers know that documentation doesn't teach.

Incident Runbooks¶

Pipeline Failing Intermittently (Flaky)¶

1. Identify the pattern:
   - Is it the same test failing? -> Flaky test
   - Is it random steps timing out? -> Resource/runner issue
   - Does it fail only at certain times? -> External dependency

2. Flaky test triage:
   - Quarantine the test: move to a separate job that's allowed to fail
   - Track it: create an issue, label it "flaky"
   - Common causes:
     * Race conditions (async code, parallel test execution)
     * Time-dependent assertions (frozen dates, timezone issues)
     * External API calls (mock them instead)
     * Shared state between tests (database not cleaned up)
   - Fix: add retry logic as a temporary band-aid, then actually fix the test
     retry-on: error
     max-attempts: 3

3. Runner resource issues:
   - GitHub-hosted runners have fixed specs (2 CPU, 7GB RAM for ubuntu-latest)
   - Heavy builds may OOM -> split into smaller jobs or use larger runners
   - Disk space: runners start with ~14GB free. Large builds/caches can fill it.
     df -h at the start of your pipeline to monitor

4. External dependency failures:
   - npm registry down, Docker Hub rate limits, apt mirrors slow
   - Use caching aggressively to avoid re-downloading
   - Mirror critical dependencies
   - Set timeouts so failures are fast, not 6-hour hangs

Pipeline Takes Too Long¶

1. Measure first:
   - GitHub Actions: check the workflow run, expand each step timing
   - Find the bottleneck: is it checkout? Dependencies? Tests? Build? Deploy?

2. Common speed improvements:

   Caching (biggest win usually):
   - Cache node_modules, pip packages, Go modules, Maven repo
   - Cache Docker layers: use buildx with cache-to/cache-from
   - Cache key must include lockfile hash

   Parallelization:
   - Split tests into parallel jobs (test sharding)
   - Run lint, type-check, unit tests in parallel (no dependencies between them)
   - Use matrix builds for multi-platform testing

   Skip unnecessary work:
   - Path filters: only run frontend tests when frontend code changes
     on:
       push:
         paths: ['frontend/**']
   - Use conditional steps:
     if: github.event_name == 'push' && github.ref == 'refs/heads/main'

   Docker build optimization:
   - Multi-stage builds (smaller images, faster push/pull)
   - Order Dockerfile layers: dependencies before source code
   - Use buildx cache: --cache-from type=gha --cache-to type=gha

   Checkout optimization:
   - Shallow clone: actions/checkout with fetch-depth: 1
   - Sparse checkout: only the files you need

3. Target: CI should complete in under 10 minutes. CD (deploy) is separate.

Secrets Management in Pipelines¶

1. Types of secrets in CI/CD:
   - Cloud credentials (AWS, GCP, Azure)
   - Container registry tokens
   - SSH keys for deployment
   - API tokens for third-party services
   - Database passwords for integration tests

2. Best practices:
   - OIDC for cloud providers (no static credentials):
     # GitHub Actions OIDC with AWS
     permissions:
       id-token: write
     steps:
       - uses: aws-actions/configure-aws-credentials@v4
         with:
           role-to-assume: arn:aws:iam::123456789:role/github-actions
           aws-region: us-east-1
     # No AWS keys stored in GitHub at all

   - Scope secrets to environments (staging secrets != production secrets)
   - Use repository-level secrets for shared values, environment-level for deploy targets
   - Never echo/print secret values in debug logs

3. Secret rotation:
   - Rotate secrets on a schedule (quarterly minimum)
   - Update the secret in GitHub Settings -> Secrets
   - Verify the pipeline still works after rotation
   - Automate rotation where possible (AWS STS, vault)

4. When a secret leaks:
   - Rotate the secret IMMEDIATELY (don't wait for the PR to merge)
   - Check GitHub audit log for who accessed it
   - Review pipeline logs for exposure
   - Check if secret was cached in artifacts or Docker layers

Self-Hosted Runner Problems¶

1. Runner offline:
   - Check the runner machine: is it up? Is the service running?
     systemctl status actions.runner.*
   - Network: can it reach github.com?
   - Registration: re-register if the token expired

2. Runner disk full:
   - Docker images accumulate: docker system prune -af
   - Work directories pile up: clean _work/ directories older than 7 days
   - Cron job for cleanup:
     0 3 * * * docker system prune -af && find /runner/_work -maxdepth 1 -mtime +7 -exec rm -rf {} \;

3. Dependency contamination:
   - Unlike GitHub-hosted runners, self-hosted runners persist between jobs
   - One job installs Node 18, next job expects Node 20 but gets 18
   - Fix: use Docker containers for each job, or use tool version managers
   - Or: run ephemeral self-hosted runners (spin up, run job, destroy)

4. Security:
   - NEVER use self-hosted runners for public repos
     (any PR can execute arbitrary code on your runner)
   - Isolate runners: dedicated VMs, not shared infrastructure
   - Use runner groups to control which repos can use which runners

Debugging a Failed Deploy Step¶

1. Read the error message completely (people skip this surprisingly often)

2. Reproduce locally:
   - What command is the step running?
   - Can you run it locally with the same inputs?
   - Are environment variables set correctly?

3. Common deploy failures:
   - Authentication expired:
     * OIDC token has a short TTL
     * AWS STS session expired during long deploy
     * Container registry token expired

   - Resource already exists / state mismatch:
     * Terraform: state doesn't match reality
     * Kubernetes: resource was modified manually
     * Fix: import/reconcile before retrying

   - Timeout:
     * Health check takes too long
     * Database migration blocking
     * DNS propagation delay
     * Increase timeout, add retry logic

   - Permission denied:
     * IAM role missing a required permission
     * Kubernetes RBAC insufficient
     * File permission on deployed artifact

4. Add better error handling:
   - Use set -euo pipefail in bash steps
   - Add explicit error messages before commands that commonly fail
   - Use step-level continue-on-error: false (default) to fail fast

Gotchas & War Stories¶

Default trap: ${{ secrets.MY_SECRET }} evaluates to an empty string if the secret does not exist -- not an error. Your deploy script silently runs with no credentials and may succeed in unexpected ways (e.g., deploying to a default environment instead of the intended one). Always validate required secrets at the start of your pipeline with an explicit check.

GitHub Actions expression gotchas ${{ secrets.MY_SECRET }} evaluates to empty string if the secret doesn't exist, not an error. Your deploy script silently runs with no credentials. Always validate required inputs at the start of your pipeline.

Docker Hub rate limits Public Docker Hub pulls are rate-limited (100 pulls/6 hours for unauthenticated). CI pipelines hit this fast. Fix: authenticate to Docker Hub, mirror images to your own registry, or use GitHub Container Registry.

The "works on my machine" gap Your local Docker build uses cached layers. CI builds from scratch. Different base image versions, different network conditions. Pin your base images and use lockfiles.

Branch protection bypass If your pipeline uses GITHUB_TOKEN to push commits (like auto-formatting), those commits bypass branch protection rules. Use a dedicated bot token or GitHub App for operations that should trigger CI.

Concurrency control Two pushes to main in quick succession = two deploys racing. Use concurrency groups:

concurrency:
  group: deploy-${{ github.ref }}
  cancel-in-progress: false   # For deploys, wait don't cancel

The matrix explosion Matrix builds are powerful but can overwhelm your runner capacity. 3 OS versions x 4 language versions x 3 database versions = 36 jobs. Be intentional about what you actually need to test.

Scale note: GitHub Actions limits concurrent jobs to 20 per repository (free tier) or 180-500 per org (paid tiers). A 36-job matrix on a busy repo can queue behind other workflows for 30+ minutes. Use matrix.include to test only the combinations that matter instead of the full Cartesian product.

Task: Implement a Canary Deploy¶

Ship v2 to 5% of traffic, watch metrics, then roll forward or back.

# Using Kubernetes and an Ingress controller (nginx)
# Step 1: Deploy canary alongside stable
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
spec:
  replicas: 1              # 1 canary pod
  selector:
    matchLabels:
      app: myapp
      track: canary
  template:
    metadata:
      labels:
        app: myapp
        track: canary
    spec:
      containers:
        - name: myapp
          image: registry.example.com/myapp:v2.0.0
---
# Step 2: Configure traffic split via Ingress annotation
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "5"   # 5% of traffic
spec:
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: myapp-canary
                port:
                  number: 8080

# Step 3: Monitor canary metrics — compare error rate to stable
# Step 4: If canary looks good, increase traffic
$ kubectl annotate ingress myapp-canary \
    nginx.ingress.kubernetes.io/canary-weight="25" --overwrite

# Step 5: If canary is bad, kill it
$ kubectl delete deployment myapp-canary
$ kubectl delete ingress myapp-canary

Task: Automated Rollback on Metrics Regression¶

#!/bin/bash
# deploy-with-rollback.sh
set -euo pipefail

IMAGE="registry.example.com/myapp:${1:?Usage: deploy.sh <tag>}"
DEPLOY="myapp"
NAMESPACE="production"
ERROR_THRESHOLD="0.05"   # 5% error rate
WATCH_DURATION=300        # 5 minutes

echo "Deploying $IMAGE..."
kubectl set image deployment/$DEPLOY myapp=$IMAGE -n $NAMESPACE
kubectl rollout status deployment/$DEPLOY -n $NAMESPACE --timeout=120s

echo "Monitoring for ${WATCH_DURATION}s..."
START=$(date +%s)
while [ $(($(date +%s) - START)) -lt $WATCH_DURATION ]; do
    ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
        --data-urlencode "query=rate(http_requests_total{app=\"myapp\",code=~\"5..\"}[2m]) / rate(http_requests_total{app=\"myapp\"}[2m])" \
        | jq -r '.data.result[0].value[1] // "0"')

    if (( $(echo "$ERROR_RATE > $ERROR_THRESHOLD" | bc -l) )); then
        echo "ERROR RATE ${ERROR_RATE} exceeds threshold. Rolling back..."
        kubectl rollout undo deployment/$DEPLOY -n $NAMESPACE
        exit 1
    fi
    sleep 15
done
echo "Deploy successful. Error rate stable."

Task: Handle Database Migrations in CI/CD¶

Database changes must be backward-compatible. Use the expand-and-contract pattern:

# Step 1 (Deploy N): Add new column, keep old column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
# App writes to BOTH old and new columns

# Step 2 (Deploy N+1): Backfill data
UPDATE users SET full_name = first_name || ' ' || last_name WHERE full_name IS NULL;

# Step 3 (Deploy N+2): App reads from new column only
# Step 4 (Deploy N+3): Drop old columns
ALTER TABLE users DROP COLUMN first_name, DROP COLUMN last_name;

Task: GitOps with ArgoCD Pattern¶

Declarative deployments where Git is the source of truth:

# CI pipeline updates the image tag in the config repo
$ cd infra-config
$ kustomize edit set image myapp=registry.example.com/myapp:abc123
$ git add . && git commit -m "deploy: myapp abc123"
$ git push
# ArgoCD detects the change and syncs the cluster

Task: Multi-Stage Docker Builds in CI¶

FROM python:3.11 AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
COPY src/ src/

FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY --from=builder /app/src src/
ENV PATH=/root/.local/bin:$PATH
USER 1000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

Quick Reference¶

Cheatsheet: Cicd
Deep Dive: Ci Cd Pipeline Architecture