GitHub Actions: CI/CD That Lives in Your Repo
- lesson
- github-actions
- ci/cd-pipelines
- docker
- oidc
- supply-chain-security
- caching
- secrets-management ---# GitHub Actions — CI/CD That Lives in Your Repo
Topics: GitHub Actions, CI/CD pipelines, Docker, OIDC, supply chain security, caching, secrets management Level: L1–L2 (Foundations → Operations) Time: 75–90 minutes Prerequisites: None (Git basics help, but everything is explained)
The Mission¶
Your team just moved a Python API from a manually-deployed VM to containers. There's no CI.
Developers push to main, SSH into the server, run docker build, and pray. Last Tuesday
someone deployed without running tests and broke authentication for 4,000 users.
Your job: build a CI/CD pipeline from scratch in GitHub Actions. By the end of this lesson you'll have a workflow that lints, tests, builds a Docker image, and deploys to staging and production — with no static credentials anywhere, matrix testing across Python versions, and security hardening that would survive a hostile fork.
We're going to build it piece by piece. Each section adds one capability to the pipeline, so you see why each feature exists before you use it.
Part 1: Anatomy of a Workflow File¶
Everything in GitHub Actions starts with a YAML file in .github/workflows/. That's it.
No external server, no Jenkins install, no webhook configuration. The CI/CD lives in the
same repo as the code.
# .github/workflows/ci.yml
name: CI Pipeline
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: echo "Hello from CI"
That's a working pipeline. Push it and GitHub runs it. Let's break down every piece:
| Line | What it does |
|---|---|
name: CI Pipeline |
Display name in the Actions tab — cosmetic but helpful |
on: push |
Trigger: run this workflow when code is pushed |
branches: [main] |
Only for pushes to main (skip feature branches here) |
on: pull_request |
Also run on PRs targeting main — your gate before merge |
jobs: |
Container for all jobs; each job gets its own fresh VM |
runs-on: ubuntu-latest |
The runner — a GitHub-hosted 2-core, 7GB RAM Azure VM |
uses: actions/checkout@v4 |
Clone your repo into the runner (most-used action in existence) |
run: |
Execute a shell command |
Trivia: The first version of GitHub Actions (2018) used HCL, not YAML. GitHub switched to YAML for the public launch in 2019, citing broader developer familiarity. The decision was controversial — many developers found YAML verbose for complex logic — but it won because nearly everyone already knew YAML from Docker Compose, Kubernetes manifests, and Ansible playbooks.
Under the Hood: GitHub-hosted runners are ephemeral Azure VMs. Each job gets a fresh VM that's destroyed after the job completes. Nothing persists between jobs — no files, no Docker images, no installed packages. This is a feature: it prevents one workflow from contaminating another. It's also why caching matters (we'll get there).
The trigger zoo¶
GitHub Actions has more trigger types than most people realize:
on:
push: # code pushed
branches: [main]
paths: ['src/**', 'tests/**'] # only when these files change
pull_request: # PR opened, updated, or reopened
types: [opened, synchronize, reopened]
schedule:
- cron: '0 6 * * 1' # Monday 6am UTC (cron syntax)
workflow_dispatch: # manual "Run workflow" button
inputs:
environment:
type: choice
options: [staging, production]
workflow_call: # called by another workflow (reusable)
release:
types: [published] # when a GitHub Release is created
The paths filter is an underrated time-saver. If you only changed a README, why rebuild
a Docker image? Skip CI on doc-only changes and save minutes (which are literally money on
private repos).
Part 2: Build the Real Pipeline — Lint, Test, Build¶
Time to replace that echo with something useful. Here's a pipeline for a Python API:
name: CI Pipeline
on:
push:
branches: [main]
paths:
- 'app/**'
- 'tests/**'
- 'requirements*.txt'
- 'Dockerfile'
- '.github/workflows/ci.yml'
pull_request:
branches: [main]
permissions:
contents: read # principle of least privilege — more on this later
jobs:
lint:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: 'pip' # built-in pip cache — no actions/cache needed
- name: Install linter
run: pip install ruff
- name: Lint
run: ruff check app/ tests/
test:
runs-on: ubuntu-latest
timeout-minutes: 10
needs: lint # only run if lint passes
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements.txt -r requirements-test.txt
- name: Run tests with coverage
run: pytest --cov=app --cov-fail-under=90 -v
- name: Upload coverage report
uses: actions/upload-artifact@v4
with:
name: coverage-report
path: htmlcov/
retention-days: 7
A few things worth noticing:
timeout-minutes — always set this. The default is 360 minutes (6 hours). If your
test suite hangs, you don't want to find out when you get the bill.
needs: lint — creates a dependency chain. test won't start until lint passes.
If lint fails, test is skipped. This saves minutes and makes failures faster to diagnose.
cache: 'pip' — the setup-python action has built-in caching. One line replaces
what used to be 10 lines of actions/cache configuration. Most setup-* actions now have
this feature (Node, Python, Go, Java).
upload-artifact — saves the coverage report as a downloadable file attached to the
workflow run. Artifacts are how you pass data between jobs (each job runs on a separate VM,
remember).
Gotcha: Artifacts are not caches. Caches persist across workflow runs (for the same branch) and are used to speed up dependency installation. Artifacts are output from a specific run — test reports, built binaries, Docker images — and expire after a configurable retention period. Mixing them up leads to either missing data or bloated cache storage.
Flashcard check¶
| Question | Answer |
|---|---|
| What is the default timeout for a GitHub Actions job? | 360 minutes (6 hours) |
What does needs: lint do? |
Creates a dependency — the job only runs if lint succeeds |
| Where do workflow files live? | .github/workflows/*.yml in the repository |
| What's the difference between a cache and an artifact? | Cache persists across runs (speeds up installs). Artifact is output from one run (reports, binaries). |
Part 3: Matrix Builds — Test Everything at Once¶
Your API needs to work on Python 3.10, 3.11, and 3.12. You could copy-paste the test job three times, or you could use a matrix:
test:
runs-on: ubuntu-latest
timeout-minutes: 10
needs: lint
strategy:
fail-fast: false # don't cancel other versions if one fails
matrix:
python-version: ['3.10', '3.11', '3.12']
include:
- python-version: '3.12'
coverage: true # only measure coverage on latest version
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Install dependencies
run: pip install -r requirements.txt -r requirements-test.txt
- name: Run tests
run: pytest -v
- name: Run tests with coverage
if: matrix.coverage
run: pytest --cov=app --cov-fail-under=90 --cov-report=xml
Three lines of YAML create three parallel jobs. Add an os dimension and you're testing
across operating systems too:
strategy:
matrix:
os: [ubuntu-latest, macos-latest]
python-version: ['3.11', '3.12']
exclude:
- os: macos-latest
python-version: '3.11' # skip one combination to save minutes
That's 3 combinations (2x2 minus 1 exclusion), each running on its own VM in parallel.
Trivia: A single matrix definition with 5 OS options, 4 language versions, and 3 dependency variants generates 60 parallel jobs. Before matrix builds, this required hundreds of lines of copy-pasted YAML. GitHub allows up to 256 jobs per workflow run.
Gotcha:
fail-fast: trueis the default, and it's a trap for matrix builds. One failing Python version cancels the other two, hiding failures. You fix the one you saw, re-run, discover another failure, fix it, re-run again — three iterations instead of one. Setfail-fast: falsefor test matrices. Usefail-fast: trueonly when all legs are identical (parallel shards of the same suite).
Part 4: Expression Syntax and Contexts¶
GitHub Actions has its own expression language inside ${{ }}. It's not shell — it's
evaluated by the GitHub runner before the shell ever sees it.
# Contexts — where data comes from
${{ github.sha }} # full commit SHA
${{ github.ref_name }} # branch name (e.g., "main")
${{ github.event.pull_request.number }} # PR number
${{ secrets.MY_SECRET }} # repository or org secret
${{ vars.MY_VARIABLE }} # non-secret config variable
${{ needs.build.outputs.version }} # output from another job
${{ runner.os }} # "Linux", "macOS", or "Windows"
${{ hashFiles('**/requirements.txt') }} # SHA-256 of file contents (for cache keys)
# Conditionals
${{ github.ref == 'refs/heads/main' }} # true on main branch
${{ contains(github.event.head_commit.message, '[skip ci]') }}
# Status check functions
${{ success() }} # default — previous steps all passed
${{ failure() }} # at least one previous step failed
${{ always() }} # run regardless (cleanup steps)
${{ cancelled() }} # workflow was cancelled
Mental Model: Think of
${{ }}expressions as a template engine, not a programming language. They're evaluated before the shell runs. This means${{ secrets.TOKEN }}gets replaced with the literal secret value in the command string. That's why you should pass secrets as environment variables (env: TOKEN: ${{ secrets.TOKEN }}) instead of inlining them inrun:commands — it avoids shell expansion surprises and keeps secrets out of process listings.
Part 5: Build and Push a Docker Image¶
Now we need to package the app. Here's the Docker build-and-push job:
build-image:
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main' # only build images on main
permissions:
contents: read
packages: write # needed to push to GHCR
outputs:
image-tag: ${{ steps.meta.outputs.version }}
steps:
- uses: actions/checkout@v4
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }} # auto-provided, no setup needed
- uses: docker/metadata-action@v5
id: meta
with:
images: ghcr.io/${{ github.repository }}
tags: |
type=sha,prefix= # abc1234
type=ref,event=branch # main
type=semver,pattern={{version}} # 1.2.3 (from git tags)
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha # GitHub Actions cache for Docker layers
cache-to: type=gha,mode=max
Let's unpack the interesting bits:
secrets.GITHUB_TOKEN — this is automatically generated for every workflow run. No
setup required. It's scoped to the repository and expires when the job ends. The
packages: write permission lets it push to GitHub Container Registry (GHCR).
docker/metadata-action — generates smart tags. A push to main gets tagged with
the short SHA and main. A release tag like v1.2.3 also gets 1.2.3. No more
hand-crafting tag logic.
cache-from: type=gha — uses GitHub Actions cache backend for Docker layer caching.
This means your second build only rebuilds the layers that changed. A full build that takes
4 minutes might take 30 seconds on the next run if only application code changed and
dependencies are cached.
Under the Hood: Docker layer caching with
type=ghastores layer blobs in the same cache infrastructure asactions/cache. Themode=maxoption caches all layers (not just the final image layers), which means even intermediate build stages get cached. This is especially valuable for multi-stage Dockerfiles where the dependency-install stage rarely changes.
Part 6: The OIDC Trick — No Static Credentials, Ever¶
This is the single most important security improvement you can make to any CI/CD pipeline. Instead of storing AWS access keys (or GCP service account keys) as repository secrets, you use OpenID Connect (OIDC) to let GitHub Actions assume a cloud IAM role directly.
No keys to rotate. No keys to leak. No keys at all.
deploy-staging:
runs-on: ubuntu-latest
needs: build-image
environment: staging # gates + environment-specific secrets
permissions:
id-token: write # THIS is what enables OIDC
contents: read
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials via OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-staging
aws-region: us-east-1
# That's it. No access key. No secret key. Just the role ARN.
- name: Deploy to ECS
run: |
aws ecs update-service \
--cluster api-staging \
--service api \
--force-new-deployment
How OIDC works (the 30-second version)¶
1. Your workflow requests a JWT from GitHub's OIDC provider
2. The JWT contains claims: repo name, branch, workflow, actor
3. AWS (or GCP/Azure) validates the JWT against GitHub's public keys
4. If the claims match the role's trust policy, AWS issues temp credentials
5. Credentials expire when the job ends — usually 1 hour
On the AWS side, you create an IAM role with a trust policy like this:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
},
"StringLike": {
"token.actions.githubusercontent.com:sub": "repo:your-org/your-repo:ref:refs/heads/main"
}
}
}]
}
The sub condition is your security boundary. It restricts which repo, branch, and
environment can assume the role. A PR from a fork can't assume your production role because
its sub claim won't match.
Remember: OIDC = "prove who you are without a shared secret." The same pattern works for GCP (
google-github-actions/auth), Azure (azure/login), and HashiCorp Vault. If your pipeline still hasAWS_ACCESS_KEY_IDin repository secrets, replacing it with OIDC is the highest-leverage security improvement you can make today.Interview Bridge: "How do you authenticate CI/CD pipelines to cloud providers without static credentials?" OIDC federation is the answer. It comes up in almost every DevOps interview that touches cloud security.
Flashcard check¶
| Question | Answer |
|---|---|
| What permission does a workflow need to use OIDC? | id-token: write |
| What does OIDC replace in CI/CD? | Static cloud credentials (access keys, service account keys) |
| Where is the security boundary for OIDC defined? | In the cloud IAM role's trust policy (sub condition) |
| What happens to OIDC-issued credentials when the job ends? | They expire (typically within 1 hour) |
Part 7: Reusable Workflows vs. Composite Actions¶
As your org grows, you'll want to share pipeline logic across repos. GitHub gives you two mechanisms, and they solve different problems.
Reusable workflows — share entire jobs¶
# .github/workflows/reusable-deploy.yml (in a central repo)
name: Deploy
on:
workflow_call:
inputs:
environment:
required: true
type: string
image-tag:
required: true
type: string
secrets:
DEPLOY_TOKEN:
required: true
jobs:
deploy:
runs-on: ubuntu-latest
environment: ${{ inputs.environment }}
steps:
- uses: actions/checkout@v4
- name: Deploy to ${{ inputs.environment }}
env:
DEPLOY_TOKEN: ${{ secrets.DEPLOY_TOKEN }}
run: |
helm upgrade api ./chart \
--set image.tag=${{ inputs.image-tag }} \
-f values-${{ inputs.environment }}.yaml
# Caller workflow (in any repo)
jobs:
deploy-staging:
uses: your-org/shared-workflows/.github/workflows/reusable-deploy.yml@main
with:
environment: staging
image-tag: ${{ needs.build.outputs.image-tag }}
secrets:
DEPLOY_TOKEN: ${{ secrets.STAGING_TOKEN }}
Composite actions — share reusable steps¶
# .github/actions/setup-python-app/action.yml (in your repo or a shared repo)
name: 'Setup Python App'
description: 'Install Python, dependencies, and lint tools'
inputs:
python-version:
required: false
default: '3.12'
outputs:
coverage:
description: 'Test coverage percentage'
value: ${{ steps.test.outputs.coverage }}
runs:
using: 'composite'
steps:
- uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python-version }}
cache: 'pip'
- run: pip install -r requirements.txt -r requirements-test.txt
shell: bash
- id: test
run: |
COV=$(pytest --cov=app --cov-report=term | grep TOTAL | awk '{print $4}')
echo "coverage=$COV" >> "$GITHUB_OUTPUT"
shell: bash
When to use which¶
| Reusable Workflow | Composite Action | |
|---|---|---|
| Scope | Entire job (with its own runs-on) |
Steps within a job |
| Runner | Runs on its own runner | Runs on the caller's runner |
| Secrets | Can receive secrets via secrets: |
Inherits caller's environment |
| Nesting | Can call other reusable workflows (max 4 deep) | Can use other actions |
| Use when | Standardizing deployment patterns across repos | Packaging repeated step sequences |
| Think of it as | A function that brings its own VM | A function that runs in your VM |
Mental Model: Reusable workflows are like microservices — self-contained, with their own environment. Composite actions are like library functions — they run inside your process. Use reusable workflows when you need isolation (deployments, security scans). Use composite actions when you need speed (no extra VM boot time).
Part 8: The Fork That Leaked Secrets¶
Time for a war story that will make you audit your workflows tonight.
War Story: In 2021, security researchers demonstrated an attack pattern they called "pwn requests." The target: workflows using
pull_request_targetthat checked out the PR's code. Here's how it works.A public repo has a workflow triggered on
pull_request_target. Unlike the normalpull_requestevent,pull_request_targetruns with the base branch's workflow file but gives the job write permissions and access to secrets. The intention is for safe operations like labeling PRs.But the workflow also runs
actions/checkoutwithref: ${{ github.event.pull_request.head.sha }}, pulling in the fork's code. The attacker submits a PR that modifies a build script orpackage.jsonpost-install hook to exfiltrate secrets. The workflow runs the attacker's code with full secrets access.This pattern compromised repositories belonging to Apache, Eclipse, and multiple CNCF projects. GitHub's security blog documented it explicitly. The fix: never check out untrusted code in a
pull_request_targetworkflow. Usepull_request_targetonly for metadata operations (labeling, commenting). Run builds with thepull_requestevent, which doesn't get secrets from forks. This is documented in GitHub's security hardening guide for Actions.
This is why the permissions key exists. Declare the minimum your workflow needs:
# Bad — gives write access to everything
permissions: write-all
# Good — explicit least privilege
permissions:
contents: read
packages: write # only if pushing to GHCR
id-token: write # only if using OIDC
Gotcha: Before February 2023, the default
GITHUB_TOKENpermissions werewrite-all. Repos created before that date still have the old default unless an admin changed it. Check your repo: Settings → Actions → General → Workflow permissions. New repos default to read-only, but inherited org settings may override this.
Part 9: Security Hardening Checklist¶
Here's the checklist. Print it. Tape it to your monitor. Review it before every workflow change.
## GitHub Actions Security Hardening
### Credentials
- [ ] Use OIDC for cloud auth — no static AWS/GCP/Azure keys in secrets
- [ ] Set minimum GITHUB_TOKEN permissions per job (not workflow-level write-all)
- [ ] Use environment protection rules for production deploys
- [ ] Rotate any remaining static secrets on a schedule
### Supply Chain
- [ ] Pin third-party actions to full SHA (not mutable tags)
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
- [ ] Enable Dependabot for GitHub Actions version updates
- [ ] Use actions only from trusted publishers or your own org
- [ ] Audit new actions before adding (read their source code)
### Fork Safety
- [ ] Never use pull_request_target + checkout of PR code
- [ ] Require approval for first-time contributors' workflow runs
- [ ] Don't use self-hosted runners on public repos
### Secrets Hygiene
- [ ] Never interpolate secrets in `run:` commands — use `env:` mapping
- [ ] Never echo or log secret values (even masked ones can leak via encoding)
- [ ] If a secret leaks, rotate immediately — logs may be cached externally
### Branch Protection
- [ ] Require status checks to pass before merge
- [ ] Require PR review (prevents direct pushes that skip CI)
- [ ] Enable branch protection for the default branch
War Story: In January 2024, the popular
tj-actions/changed-filesaction was compromised via a supply chain attack. The attacker modified thev35tag to inject code that exfiltrated CI secrets. Organizations using@v35(a mutable tag) ran the compromised version. Organizations that pinned to a specific SHA were completely unaffected. This incident, along with the earliercodecov/codecov-actioncompromise in 2021, is why security teams now mandate SHA pinning for all third-party actions.
Part 10: Debugging Failed Workflows¶
Your workflow is failing. The logs are 2,000 lines of noise. Here's the systematic approach.
1. Start with gh CLI — faster than the web UI¶
# List recent failures
gh run list --status failure --limit 10
# View the failed run
gh run view 12345678
# View only the failed job's logs (skip the successful ones)
gh run view 12345678 --log-failed
# Rerun only the failed jobs (not the whole workflow)
gh run rerun 12345678 --failed
# Watch a running workflow in real-time
gh run watch 12345678
2. Enable debug logging¶
Add these repository secrets (not variables — secrets):
This dumps verbose runner and step output. Remove them when done — the logs get enormous.
3. Test locally with act¶
# Install act (runs workflows locally using Docker)
brew install act # macOS
# or: curl -s https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash
# List available jobs
act -l
# Run a specific job
act push -j lint
# With a secrets file (.secrets, one per line: KEY=value)
act push -j test --secret-file .secrets
# Use a fuller runner image (default is minimal)
act push --platform ubuntu-latest=ghcr.io/catthehacker/ubuntu:act-latest
Gotcha:
actdoesn't perfectly replicate GitHub Actions. Services, OIDC, caching, and someGITHUB_*context variables don't work. It's excellent for testing shell commands and basic logic, but don't rely on it for caching or auth workflows. Think of it as a fast feedback loop for the 80% case.
4. The SSH debug escape hatch¶
When logs aren't enough, you can SSH into a running runner:
- name: Debug via SSH
if: failure()
uses: mxschmitt/action-tmate@v3
with:
limit-access-to-actor: true # only the person who triggered the run can connect
This pauses the workflow and gives you a tmate SSH URL. You get a shell on the actual runner with the full environment. Poke around, check environment variables, test commands interactively. Don't leave it in production workflows — it blocks the runner until the timeout.
Part 11: Common Workflow Patterns¶
CI — run on every PR¶
on:
pull_request:
branches: [main]
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true # cancel stale runs when new commits push
CD — deploy on merge to main¶
on:
push:
branches: [main]
concurrency:
group: deploy-production
cancel-in-progress: false # NEVER cancel in-progress deploys
Gotcha:
cancel-in-progress: trueon deployment workflows can leave your infrastructure in a half-migrated state. A database migration gets cancelled mid-flight, your schema is half-applied, and the app crashes. Usecancel-in-progress: truefor CI (tests are safe to cancel). Usecancel-in-progress: falsefor anything that mutates state.
Scheduled — weekly dependency updates, stale cache warming¶
Release — triggered by publishing a GitHub Release¶
on:
release:
types: [published]
jobs:
publish:
steps:
- run: echo "Publishing ${{ github.event.release.tag_name }}"
Part 12: Cost Management¶
GitHub Actions bills by the minute for private repos. Public repos get unlimited free minutes.
| Runner type | Cost per minute (private repos) | vCPUs | RAM |
|---|---|---|---|
ubuntu-latest |
$0.008 | 2 | 7 GB |
ubuntu-latest (4-core) |
$0.016 | 4 | 16 GB |
macos-latest |
$0.08 | 3 | 14 GB |
windows-latest |
$0.016 | 2 | 7 GB |
macOS runners cost 10x Linux runners. Windows costs 2x. If your matrix tests macOS + Windows + Linux across 4 language versions, your monthly bill might surprise you.
Cost reduction tactics:
- Path filters — skip CI for doc-only changes
cancel-in-progress: trueon CI — don't test stale pushes- Caching — pip, npm, Docker layers, apt packages
- Matrix exclusions — do you really need macOS CI on every PR, or just main?
- Timeout limits — a stuck job at default 6 hours eats 6 × $0.008 = $0.048. Not much once, but a flaky workflow that hangs 10 times a week adds up.
- Self-hosted runners — if you're spending >$500/month on Actions, self-hosted runners on your own infrastructure (or spot instances) often pay for themselves
Trivia: GitHub Actions provides unlimited free minutes for public repositories. This single decision in 2019 effectively killed Travis CI's business model — Travis had been the dominant CI platform for open source, offering free tiers that organizations applied for individually. GitHub made it automatic. Travis CI laid off most of its engineering team in 2020.
Part 13: The Complete Pipeline¶
Here's everything assembled — a production-grade workflow for the Python API we started with. Every annotation refers back to a section of this lesson.
name: CI/CD Pipeline
on:
push:
branches: [main]
paths:
- 'app/**'
- 'tests/**'
- 'requirements*.txt'
- 'Dockerfile'
- '.github/workflows/ci.yml'
pull_request:
branches: [main]
permissions:
contents: read # Part 8: least privilege default
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# --- CI Gate ---
lint:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
cache: 'pip'
- run: pip install ruff
- run: ruff check app/ tests/
test:
runs-on: ubuntu-latest
timeout-minutes: 10
needs: lint # Part 2: dependency chain
strategy:
fail-fast: false # Part 3: see all failures at once
matrix:
python-version: ['3.10', '3.11', '3.12']
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- run: pip install -r requirements.txt -r requirements-test.txt
- run: pytest --cov=app --cov-fail-under=90 -v
# --- Docker Build ---
build-image:
runs-on: ubuntu-latest
needs: test # all matrix legs must pass
if: github.ref == 'refs/heads/main'
permissions:
contents: read
packages: write # Part 5: push to GHCR
outputs:
image-tag: ${{ steps.meta.outputs.version }}
steps:
- uses: actions/checkout@v4
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/metadata-action@v5
id: meta
with:
images: ghcr.io/${{ github.repository }}
tags: |
type=sha,prefix=
type=ref,event=branch
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
# --- Deploy to Staging ---
deploy-staging:
runs-on: ubuntu-latest
needs: build-image
if: github.ref == 'refs/heads/main'
environment: staging # Part 6: environment protection
permissions:
id-token: write # Part 6: OIDC
contents: read
concurrency:
group: deploy-staging
cancel-in-progress: false
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-staging
aws-region: us-east-1
- name: Deploy to staging
run: |
aws ecs update-service \
--cluster api-staging \
--service api \
--force-new-deployment
# --- Deploy to Production ---
deploy-production:
runs-on: ubuntu-latest
needs: deploy-staging
if: github.ref == 'refs/heads/main'
environment: production # requires manual approval in GitHub settings
permissions:
id-token: write
contents: read
concurrency:
group: deploy-production
cancel-in-progress: false # Part 11: never cancel deploys
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-production
aws-region: us-east-1
- name: Deploy to production
run: |
aws ecs update-service \
--cluster api-production \
--service api \
--force-new-deployment
# --- Notify on Failure ---
notify:
runs-on: ubuntu-latest
needs: [lint, test, build-image, deploy-staging, deploy-production]
if: failure() # Part 4: status check function
steps:
- name: Send failure notification
run: |
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H 'Content-Type: application/json' \
-d '{"text": "Pipeline failed for ${{ github.repository }}@${{ github.sha }}"}'
The flow¶
push to main
|
v
[lint] ──fail──> stop
|
v (pass)
[test: 3.10] ─┐
[test: 3.11] ─┼─ any fail ──> stop
[test: 3.12] ─┘
|
v (all pass)
[build-image] ──> push to GHCR
|
v
[deploy-staging] ──> OIDC auth ──> ECS update
|
v
[deploy-production] ──> manual approval gate ──> OIDC auth ──> ECS update
|
v
done (or [notify] on any failure)
Exercises¶
Exercise 1: Your first workflow (2 minutes)¶
Create .github/workflows/hello.yml in any repository:
name: Hello
on: [push]
jobs:
greet:
runs-on: ubuntu-latest
steps:
- run: echo "SHA is ${{ github.sha }}"
Push it. Watch it run in the Actions tab.
What to look for
- The workflow appears in the Actions tab within seconds - The log shows the full SHA, not the literal string `${{ github.sha }}` - The runner is a fresh Ubuntu VM — check the "Set up job" step for the image versionExercise 2: Add a matrix (10 minutes)¶
Take the hello workflow and make it run on three different Ubuntu versions:
ubuntu-22.04, ubuntu-24.04, and ubuntu-latest.
Add a step that prints cat /etc/os-release so you can see the actual OS version for each.
Exercise 3: Build a secure pipeline (30 minutes)¶
Take the complete pipeline from Part 13 and adapt it for a Node.js application. You'll need to:
- Replace
setup-pythonwithsetup-node - Replace
pytestwith your test runner (jest, vitest, etc.) - Replace
ruffwitheslintorbiome - Keep OIDC, matrix builds, and SHA-pinned actions
Bonus: add a step that runs npm audit and fails the build if high-severity
vulnerabilities are found.
Key differences
- `setup-node` uses `cache: 'npm'` (or `cache: 'pnpm'`) - Node.js has more OS-dependent behavior than Python — matrix testing across OS matters more - `npm audit --audit-level=high` exits non-zero on high/critical vulns - `npm ci` is the CI equivalent of `npm install` — deterministic, uses lockfile onlyCheat Sheet¶
| What | How |
|---|---|
| Trigger on push | on: push: branches: [main] |
| Skip CI for docs | on: push: paths: ['src/**'] |
| Manual trigger | on: workflow_dispatch: |
| Job dependency | needs: [lint, test] |
| Matrix build | strategy: matrix: python: ['3.10', '3.12'] |
| See all failures | strategy: fail-fast: false |
| Cache deps | setup-python: with: cache: 'pip' or actions/cache@v4 |
| Pass data between jobs | outputs: + ${{ needs.job.outputs.key }} |
| OIDC auth | permissions: id-token: write + aws-actions/configure-aws-credentials |
| Pin action to SHA | uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 |
| Cancel stale CI | concurrency: group: ci-${{ github.ref }} + cancel-in-progress: true |
| Queue deploys | concurrency: group: deploy-prod + cancel-in-progress: false |
| Run on failure only | if: failure() |
| Run always | if: always() |
| Environment gate | environment: production (set approval rules in repo settings) |
| Debug a run | gh run view <id> --log-failed |
| Rerun failures | gh run rerun <id> --failed |
| Test locally | act push -j build |
Takeaways¶
-
OIDC over static keys, always. The single biggest security win for any CI/CD pipeline. No credentials to leak, no keys to rotate.
-
Pin third-party actions to SHAs. Tags are mutable pointers. A compromised tag hijacks every workflow that references it. SHAs are immutable.
-
Set
fail-fast: falseon test matrices. See all failures in one run, not three sequential "fix and retry" cycles. -
Declare
permissionsexplicitly. The default might bewrite-allif your repo predates February 2023. Least privilege isn't paranoia — it's the difference between a nuisance PR and a full repo compromise. -
CI/CD lives in your repo, which means it's code. Review workflow changes like you review application code. A one-line YAML change can expose every secret in your org.
-
The
environmentfeature is your production gate. Required reviewers, wait timers, and branch restrictions on environments prevent accidental (or malicious) deploys.
Related Lessons¶
- What Happens When You
git pushto CI — traces the journey fromgit pushto running CI, covering Git internals and webhooks - GitOps: The Repo Is the Truth — what happens after CI when your deployment model is pull-based
- Supply Chain Security: Trusting Your Dependencies — the broader supply chain context for action pinning and artifact verification
- Secrets Management Without Tears — OIDC in the context of a full secrets management strategy
- What Happens When You
docker build— Docker layer caching explained from the ground up