CI/CD Footguns¶

Mistakes that ship broken code, leak secrets, or make your pipeline a liability instead of a safety net.

1. `actions/checkout@main` instead of pinning¶

You reference GitHub Actions by branch. The action maintainer pushes a malicious update to main. Your next CI run executes arbitrary code with your repo's secrets. This has happened to real projects.

Fix: Pin actions by full SHA: actions/checkout@a5ac7e51b41094c92402da3b24376905380afc29. Or at minimum, pin to a tag: actions/checkout@v4.

War story: In March 2025, the widely-used tj-actions/changed-files GitHub Action was compromised. An attacker injected malicious code that exfiltrated CI secrets (AWS keys, NPM tokens, GitHub tokens) from any workflow using the unpinned action. Over 23,000 repositories were exposed. SHA-pinning would have prevented the compromise because the malicious commit had a different hash. This incident drove GitHub to add Dependabot support for action SHA pinning.

2. `docker build` with no `.dockerignore`¶

You build an image from a repo that contains .git, .env, node_modules, and test fixtures. Your image is 2GB and contains your git history (with any secrets ever committed) and your .env with production keys.

Fix: Always create a .dockerignore: .git, .env, node_modules, *.md, test/, Dockerfile.

3. Running tests on `main` after merge¶

You merge PRs to main without requiring CI to pass first. A broken commit lands on main. The next 5 PRs are also broken because they're based on a broken main. Now you're debugging 6 problems at once.

Fix: Require status checks on PRs before merge. Enable branch protection. Use merge queues for high-traffic repos.

4. Secrets in CI logs¶

Your build script echoes environment variables for debugging. Your CI system logs everything. Now your API keys are in build logs that 20 people can read. GitHub Actions masks ${{ secrets.X }} but not if you pass it through echo or env.

Fix: Never echo secrets. Use ::add-mask:: in GitHub Actions. Review CI logs periodically for leaked values. Use no_log: true in Ansible CI tasks.

5. No rollback plan for deployments¶

Your CI deploys directly to production. The deploy succeeds (containers start) but the app has a bug. There's no automated rollback. The manual rollback process requires finding the previous image tag in Slack messages.

Fix: Tag every release immutably. Automate rollback: helm rollback or kubectl rollout undo. Health-check the deploy and auto-rollback on failure.

6. Self-hosted runners without isolation¶

You use a shared self-hosted CI runner. PR #1 runs npm install which modifies node_modules. PR #2 runs on the same runner and picks up PR #1's modified dependencies. Or worse, a PR from a fork runs malicious code that persists on the runner.

Fix: Use ephemeral runners (new VM per job). Don't run CI from forks without approval. Use container-based runners for isolation.

7. `--auto-approve` in CI without plan review¶

Your pipeline runs terraform apply --auto-approve on every merge. A PR with a subtle config change deletes your production database. No one reviewed the plan because CI applies automatically.

Fix: Require terraform plan output in PR comments. Human approval gate before apply. Use CODEOWNERS for infrastructure changes.

8. Testing against production APIs¶

Your CI integration tests hit the real Stripe API, the real database, or the real third-party service. A test bug charges a customer's credit card, deletes real data, or exhausts your API rate limit.

Fix: Use test/sandbox environments for external services. Mock external APIs in tests. Use separate credentials with limited permissions for CI.

9. Cache poisoning¶

You cache ~/.npm or ~/.m2 across builds for speed. An attacker modifies a cached dependency. Every subsequent build uses the poisoned cache. The compromised package ships to production.

Fix: Key caches on lockfile hashes: key: npm-${{ hashFiles('**/package-lock.json') }}. Verify checksums. Clear caches periodically.

10. Deploy on tag push without build verification¶

You push a git tag and CI deploys from it. But the tag was created from an unverified commit — no tests ran, no scanning happened. The tag just bypassed your entire quality pipeline.

Fix: Tags should only trigger from already-verified commits. Build → test → scan → tag → deploy. Never let a tag trigger a deploy without the full pipeline.

11. Long-lived CI credentials¶

You created an AWS access key for CI three years ago. It has admin permissions. It's been copied to 4 CI systems, 2 wikis, and a Slack message. No one knows who created it or if it's still needed.

Fix: Use OIDC federation (GitHub → AWS) for keyless auth. If you must use keys, rotate them quarterly. Scope permissions to exactly what CI needs. Audit key usage.

12. No step-level timeouts — hung jobs consume runner minutes for hours¶

A deployment step starts, the remote service hangs without closing the connection, and the pipeline job sits stuck at "deploying" for 6 hours until the global job timeout kills it. Meanwhile, runner slots are consumed and other jobs queue.

Fix: Set explicit timeouts on every step, especially deployments, external API calls, and health checks. In GitHub Actions: timeout-minutes: 10 per job and timeout-minutes: 5 per step for critical steps. For shell commands: timeout 120 ./deploy.sh.

13. Pipeline becomes the production bottleneck¶

The pipeline runs 45 minutes end-to-end. With 10 engineers committing, the queue grows throughout the day. Developers start batching commits to reduce waits, which makes debugging failures harder. The pipeline is now the team's primary productivity constraint.

Fix: Pipeline time is a team-level SLA. Set a target (10 minutes for PR feedback, 20 minutes end-to-end). Parallelize independent jobs, cache aggressively, skip unchanged components with path filters, split slow test suites.

14. Parallel job race conditions — concurrent jobs write to the same resource¶

Two parallel jobs both write to a shared test database. Sometimes they conflict: one test's setup overwrites another's data, assertions fail randomly, and the failure looks like a flaky test.

Fix: Parallel jobs must not share mutable state. For databases: each job gets its own schema or fresh container. For cloud resources: use unique identifiers in resource names (tests-${CI_JOB_ID}). If you can't fully isolate, serialize the conflicting jobs with needs: dependencies.

15. Self-modifying runners — pipeline modifies the runner environment that persists¶

A self-hosted runner runs jobs sequentially. Job A installs a global package or modifies /etc/hosts. Job B runs later and depends on the modified environment — sometimes. When Job B runs on a different runner, it fails.

Fix: Treat self-hosted runners as cattle, not pets. Use Docker containers for job isolation. Clean up after each job in a post: step. Never rely on state installed by a previous job. Use ephemeral runners that terminate after each job.

16. CD without health checks — deploy "succeeds" but app is broken¶

Your pipeline deploys automatically after tests pass. The deploy succeeds (container starts, exit code 0). But the application cannot connect to the database and returns 503 on every request. CD marked the deploy as successful because it only checked "did the container start."

Fix: Add readiness and liveness probes. Make your deploy script wait for health checks to pass. Verify the app actually responds after rollout completes.

17. Deploying on Friday afternoon¶

You merge a large PR at 4:30 PM Friday. The deploy goes out. At 6 PM, on-call gets paged. The person who wrote the code is offline for the weekend. The incident lasts until Monday.

Fix: Implement deploy freezes for high-risk periods. Automate guards that block production deploys on Fridays after 2 PM, before holidays, and during incidents.

18. Docker layer cache invalidation — slow builds from bad Dockerfile ordering¶

Your Dockerfile copies the entire source tree before running pip install. Every commit invalidates the dependency layer, causing a full reinstall on every build. Builds take 8 minutes instead of 2.

Fix: Copy the dependency manifest first, install, then copy source:

COPY requirements.txt .
RUN pip install -r requirements.txt
COPY src/ src/

19. Environment drift between staging and production¶

Staging uses Postgres 14, prod uses Postgres 16. A query works on one but has a different execution plan on the other. The bug only appears in production.

Fix: Use infrastructure as code with environment-specific overlays, not separate configs. Pin all versions across environments. Automate environment configuration so it cannot drift.

20. Testing only against mocked dependencies¶

Your CI integration tests mock every external service. Tests pass. In production, the S3 SDK uses a different authentication flow and the third-party API returns a field your mock did not include.

Fix: Use real dependencies in integration tests via Docker Compose or testcontainers. Keep mocks for unit tests only.