Packer: Building Machine Images That Don't Lie
- lesson
- packer
- machine-images
- immutable-infrastructure
- ansible
- docker
- ci/cd
- image-testing
- golden-images
- cloud-init ---# Packer — Building Machine Images That Don't Lie
Topics: Packer, machine images, immutable infrastructure, Ansible, Docker, CI/CD, image testing, golden images, cloud-init Strategy: Build-up + parallel Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Prerequisites: None (everything is explained from scratch)
The Mission¶
Your team runs 40 EC2 instances behind an autoscaling group. Every instance launches from
the same AMI — at least, that's what everyone believes. But last Thursday, a new instance
joined the group and immediately started throwing 500s. The app binary was there. Nginx was
there. But the monitoring agent was missing, and /etc/app/config.yaml had settings from
two releases ago.
How? Someone SSHed into a running instance three months ago, "fixed" the config, and never updated the image. The autoscaler launched a fresh instance from the original AMI — the one without the fix. The instance was "correct" by the image's definition and wrong by reality's definition.
This is image drift, and it will ruin your weekend. This lesson builds the cure: a Packer-driven image pipeline where every server boots from a known, tested, version-controlled image. No hand-patching. No snowflakes. No surprises at 3 AM.
Why Machine Images? (The Debate You Need to Understand)¶
There are two schools of thought about how to get software onto a server, and most teams use both without realizing they have chosen a philosophy.
School 1: Configure at Boot (Config Management)¶
Launch a bare OS image. On first boot, Ansible/Chef/Puppet runs and installs everything. Every instance converges to the desired state.
School 2: Bake Everything In (Golden Images)¶
Install everything into the image at build time. Launch it. It is ready instantly.
| Factor | Config at Boot | Golden Image |
|---|---|---|
| Launch speed | 5–15 minutes | 30 seconds |
| Drift risk | High (convergence can fail) | Low (image is immutable) |
| Debugging | Check Ansible logs on every instance | Check one build log |
| Rollback | Re-run old playbook (hope it works) | Launch old AMI |
| Cost | Compute time on every boot | Build time once |
Mental Model: Think of it like restaurants. Config management is cooking each meal to order — flexible, but slow and error-prone at scale. Golden images are meal-prepping on Sunday — fast to serve, consistent every time, but you have to rebuild the whole batch when the recipe changes.
The real world uses both: bake the base, configure the specifics. Packer builds an image with the OS, packages, agents, and hardening. Cloud-init injects secrets, hostnames, and endpoints at boot.
Remember: The bake-vs-boot mnemonic: BOSS — Binaries bake, OS config bake, Secrets boot, Settings (env-specific) boot.
What Is Packer?¶
Packer builds identical machine images for multiple platforms from a single source configuration. You define what the image should contain once. Packer produces an AMI, a Docker image, a Vagrant box — whatever you need — from that definition. It is not a configuration management tool. It does not run on live servers. It runs once, produces an artifact, and exits.
Name Origin: Packer was created by Mitchell Hashimoto (HashiCorp co-founder) and released in July 2013. It was one of HashiCorp's earliest tools, predating Terraform (2014). Before Packer, teams built images by booting a VM, manually installing software, and snapshotting — a non-reproducible process called "golden image by hand."
The Four Building Blocks¶
| Concept | Role | Analogy |
|---|---|---|
| Template | HCL2 file(s) defining the entire build | The recipe |
| Builder | Plugin that creates the image for a platform (AWS, Docker, QEMU) | The kitchen |
| Provisioner | Runs inside the build to install/configure (shell, Ansible, file) | The chef |
| Post-processor | Acts on the finished artifact (manifest, push, compress) | The packaging line |
Builders create the blank canvas. Provisioners paint on it. Post-processors ship it.
Your First Packer Template¶
Before theory gets heavy, let's look at a real template. This builds an Ubuntu AMI on AWS with Nginx installed.
# ubuntu-nginx.pkr.hcl
packer {
required_plugins {
amazon = {
version = ">= 1.3.0"
source = "github.com/hashicorp/amazon"
}
}
}
variable "aws_region" {
type = string
default = "us-east-1"
}
source "amazon-ebs" "ubuntu" {
ami_name = "ubuntu-nginx-{{timestamp}}"
instance_type = "t3.micro"
region = var.aws_region
source_ami_filter {
filters = {
name = "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"
root-device-type = "ebs"
virtualization-type = "hvm"
}
owners = ["099720109477"] # Canonical's AWS account ID
most_recent = true
}
ssh_username = "ubuntu"
tags = {
Name = "ubuntu-nginx"
Built-By = "packer"
Git-SHA = "{{env `GIT_SHA`}}"
Build-Time = "{{timestamp}}"
}
}
build {
sources = ["source.amazon-ebs.ubuntu"]
provisioner "shell" {
inline = [
"sudo apt-get update",
"sudo apt-get install -y nginx=1.18.0-6ubuntu14.4",
"sudo systemctl enable nginx"
]
}
post-processor "manifest" {
output = "manifest.json"
strip_path = true
}
}
Let's break down what each piece does:
| Block | What it does |
|---|---|
packer { required_plugins } |
Declares plugin dependencies — packer init downloads them |
variable "aws_region" |
Parameterizes the region so you can override it per environment |
source "amazon-ebs" "ubuntu" |
Configures the builder: launch a t3.micro, find the latest Ubuntu 22.04 AMI, SSH in as ubuntu |
source_ami_filter |
Finds the latest official Canonical AMI dynamically instead of hardcoding an AMI ID |
build { sources } |
References which source(s) to build |
provisioner "shell" |
Runs commands inside the build instance — installs and enables Nginx |
post-processor "manifest" |
Writes the resulting AMI ID to manifest.json for downstream tools |
Run It¶
# Download plugins
packer init .
# Check for syntax errors (2 seconds vs finding out 10 minutes into a build)
packer validate .
# Build the image
packer build .
Gotcha:
packer validatecatches syntax errors. It does NOT catch runtime errors like "this AMI doesn't exist" or "your AWS credentials are expired." Those fail duringpacker build. Always validate, but don't treat a passing validate as a guarantee.
Flashcard Check #1¶
Cover the answers. Test yourself.
| Question | Answer |
|---|---|
| What are Packer's four building blocks? | Template, Builder, Provisioner, Post-processor |
| What's the difference between a builder and a provisioner? | Builder creates the blank image for a platform; provisioner installs software inside it |
Why use source_ami_filter instead of hardcoding an AMI ID? |
The latest AMI ID changes with every Canonical release; the filter always finds the newest one |
What does packer init do? |
Downloads plugins defined in required_plugins blocks |
| What command format is recommended: JSON or HCL2? | HCL2 — JSON templates are legacy (pre-2020) and lack variables, locals, and functions |
The Full Pipeline: Packer + Ansible + Testing¶
A shell provisioner works for simple cases. For real infrastructure, you want Ansible running your hardening playbook, your monitoring agent role, and your application setup — the same playbook you'd use anywhere, just executed inside a Packer build.
# golden-ami.pkr.hcl — production template with Ansible + cleanup
packer {
required_plugins {
amazon = { version = ">= 1.3.0", source = "github.com/hashicorp/amazon" }
ansible = { version = ">= 1.1.0", source = "github.com/hashicorp/ansible" }
}
}
variable "app_version" { type = string }
variable "env" { type = string; default = "dev" }
locals {
ami_name = "app-${var.app_version}-${formatdate("YYYYMMDD-hhmm", timestamp())}"
}
source "amazon-ebs" "app" {
ami_name = local.ami_name
instance_type = "t3.medium"
region = "us-east-1"
source_ami_filter {
filters = { name = "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*" }
owners = ["099720109477"]
most_recent = true
}
ssh_username = "ubuntu"
ami_regions = ["us-east-1", "us-west-2", "eu-west-1"]
tags = {
Name = local.ami_name
App-Version = var.app_version
Git-SHA = "{{env `GIT_SHA`}}"
Built-By = "packer"
Base-OS = "ubuntu-22.04"
}
}
build {
sources = ["source.amazon-ebs.app"]
provisioner "file" {
source = "files/cloud-init-defaults.yaml"
destination = "/tmp/cloud-init-defaults.yaml"
}
provisioner "ansible" {
playbook_file = "ansible/site.yml"
extra_arguments = ["--extra-vars", "app_version=${var.app_version} env=${var.env}", "-v"]
ansible_env_vars = ["ANSIBLE_HOST_KEY_CHECKING=False"]
}
# Security cleanup — never skip this
provisioner "shell" {
inline = [
"rm -f /home/*/.ssh/authorized_keys",
"rm -f /root/.ssh/authorized_keys",
"sudo truncate -s 0 /etc/machine-id",
"sudo rm -rf /tmp/* /var/tmp/*",
"history -c"
]
}
post-processor "manifest" { output = "manifest.json"; strip_path = true }
}
Under the Hood: When you use the
ansibleprovisioner, Packer generates a temporary SSH keypair, boots the build instance, and passes the SSH connection details to Ansible. Ansible runs against the build instance exactly like any other target — same playbook, same roles, same variables. The only difference: this instance will be snapshotted into an image and destroyed.
Why the Cleanup Step Matters¶
That cleanup provisioner is not optional. Without it: Packer's temporary SSH key stays in
authorized_keys (anyone with it can SSH into every launched instance), /etc/machine-id
is identical across all instances (breaks DHCP and journal logging), and shell history
leaks build commands.
Gotcha: Packer creates a temporary security group with port 22 open to 0.0.0.0/0. If the build fails and cleanup doesn't run, that security group lingers. Always use
-on-error=cleanup(the default) in CI.
War Story: The AMI That Worked in Dev¶
War Story: A fintech team baked database credentials into their golden AMI. Months later, they rotated the password. Every new instance launched from the old AMI silently connected with stale credentials — intermittent auth failures that took three days to trace. Fix: 10 minutes. Finding the cause: 72 hours and a sev-2 incident. The rule: images contain software and configuration, never credentials.
Violating the bake-vs-boot boundary doesn't fail immediately. It creates a time bomb that detonates on credential rotation, key expiration, or certificate renewal.
Parallel Builds: One Template, Multiple Platforms¶
Here is where Packer's design really shines. You have one team that runs on AWS, another that uses Docker for local development, and a third that uses Vagrant for testing. Three platforms, one source of truth.
# multi-platform.pkr.hcl — three sources, one build block
source "amazon-ebs" "app" {
ami_name = "app-{{timestamp}}"
instance_type = "t3.micro"
region = "us-east-1"
source_ami_filter {
filters = { name = "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*" }
owners = ["099720109477"]
most_recent = true
}
ssh_username = "ubuntu"
}
source "docker" "app" {
image = "ubuntu:22.04"
commit = true
}
source "vagrant" "app" {
source_path = "ubuntu/jammy64"
provider = "virtualbox"
communicator = "ssh"
}
build {
sources = [
"source.amazon-ebs.app",
"source.docker.app",
"source.vagrant.app"
]
provisioner "shell" {
inline = [
"sudo apt-get update",
"sudo apt-get install -y nginx=1.18.0-6ubuntu14.4 curl jq",
"sudo systemctl enable nginx || true" # Docker has no systemd
]
}
provisioner "ansible" {
playbook_file = "ansible/app.yml"
}
# Docker-specific post-processors use "only" to target one source
post-processor "docker-tag" {
repository = "registry.internal/app"
tags = ["latest", var.build_tag]
only = ["docker.app"]
}
post-processor "manifest" {
output = "manifest.json"
strip_path = true
}
}
Build all three at once:
Build just the Docker image:
Build just the AMI:
Trivia: Packer executes multi-platform builds in parallel by default. If you define three sources, Packer launches three builds simultaneously. This was unusual for infrastructure tools in 2013. The build-once-deploy-many model means the slow build happens once (15–30 minutes for an AMI) and then hundreds of servers launch from the pre-baked image in seconds.
Packer Docker vs Dockerfile¶
Most Docker images should use a Dockerfile — it has layer caching and the ecosystem expects it. Packer's Docker builder exists for one specific case: when you need the same provisioning to produce both a VM image and a container image from one template. If you only need a container, use a Dockerfile.
Flashcard Check #2¶
| Question | Answer |
|---|---|
| Why should you never bake secrets into a machine image? | Images are copied, shared, and stored — secrets persist in every copy forever |
What does packer build -only='docker.app' . do? |
Builds only the Docker source, skipping other sources in the template |
| What's the cleanup provisioner for? | Removes temporary SSH keys, clears machine-id, deletes history — prevents security issues |
Variables: No More Hardcoded Values¶
You already saw variables in the full template above. Here's the quick reference for how to set them — three ways, in order of precedence (highest wins):
# 1. Command-line flag (highest precedence)
packer build -var 'aws_region=us-west-2' -var 'app_version=2.1.0' .
# 2. Variable file
packer build -var-file=prod.pkrvars.hcl .
# 3. Environment variable (prefix with PKR_VAR_)
export PKR_VAR_aws_region=us-west-2
packer build .
Variable files are HCL: aws_region = "us-east-1", one per line. Use locals {} blocks
for computed values that appear in multiple places (like AMI names with timestamps).
Gotcha:
timestamp()returns a different value every time Packer runs. If you need reproducible names for testing, pass the timestamp as a variable instead.
The Immutable Infrastructure Pattern¶
Packer was built on a philosophy: servers should never be modified after deployment. If you need a change, build a new image and replace the old instances.
Traditional (mutable):
server → patch → patch → patch → drift → mystery config → 3am incident
Immutable:
image v1 → deploy → works
change needed → image v2 → deploy → works
rollback needed → image v1 → deploy → works
Trivia: The "immutable infrastructure" concept was championed by Chad Fowler in a 2013 blog post, the same year Packer was released. Netflix was one of Packer's earliest prominent users, using it to bake AMIs for their entire fleet — an approach that became an industry best practice.
How Immutable Infrastructure Connects to CI/CD¶
The image pipeline looks like this:
code change → CI triggers Packer build → image created
→ automated tests (boot, smoke, compliance)
→ promote to production account
→ Terraform deploys new instances from promoted image
→ old instances drained and terminated
Packer owns the image. Terraform owns the infrastructure. Clean boundary.
Testing Images Before Production¶
Building an image without testing it is like shipping code without running tests. The image boots fine on your laptop's mental model — but does it actually start Nginx? Does the health check pass? Is the right version of Python installed?
Goss — Lightweight Image Testing¶
Goss is a YAML-based server validation tool. Write what you expect, run it, get a pass/fail.
# goss/goss.yaml — declare what "correct" looks like
package:
nginx: { installed: true, versions: ["1.18.0"] }
service:
nginx: { enabled: true }
port:
tcp:80: { listening: true }
file:
/etc/app/config.yaml: { exists: true }
Run it as the last provisioner — if Goss exits non-zero, the build fails:
provisioner "file" {
source = "goss/goss.yaml"; destination = "/tmp/goss.yaml"
}
provisioner "shell" {
inline = [
"curl -fsSL https://goss.rocks/install | GOSS_VER=v0.4.4 sh",
"goss -g /tmp/goss.yaml validate --retry-timeout 30s"
]
}
InSpec — Compliance-Focused Testing¶
For compliance (CIS benchmarks, STIG, PCI-DSS), InSpec runs policy-as-code against the image. The CI pattern: Packer builds, a test job boots an instance, runs InSpec, tears down, and only promotes the AMI if all checks pass.
Packer build → deploy test instance → run Goss/InSpec → terminate
→ pass? → copy AMI to prod account, tag as "approved"
→ fail? → alert team, do not promote
Interview Tip: "How do you ensure your AMIs are secure and up to date?" Strong answer: Packer pipeline builds weekly from the latest base AMI, runs CIS hardening via Ansible, validates with InSpec/Goss tests, promotes through dev → staging → prod accounts. Old AMIs are deregistered after 90 days.
Image Pipeline in CI/CD¶
The CI workflow follows this pattern: push to packer/** triggers a build, or a weekly
cron rebuilds to pick up base image patches.
# .github/workflows/build-ami.yml (key steps)
on:
push:
paths: ["packer/**", "ansible/**"]
schedule:
- cron: "0 6 * * 1" # Weekly Monday 6am UTC
jobs:
build:
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-packer@main
- run: packer init packer/
- run: packer fmt -check packer/ && packer validate packer/
- run: packer build -var "app_version=${{ github.sha }}" packer/
env:
GIT_SHA: ${{ github.sha }}
Use OIDC federation (aws-actions/configure-aws-credentials with role-to-assume)
instead of long-lived AWS access keys in CI secrets.
Remember: Tag every AMI with: git SHA, build timestamp, base OS version, and builder identity. Mnemonic: STOB — SHA, Time, OS, Builder. When an instance misbehaves in production, these tags let you trace it back to the exact commit that built the image.
Flashcard Check #3¶
| Question | Answer |
|---|---|
| What is the immutable infrastructure pattern? | Never modify servers after deployment — build a new image and replace |
| Name two image testing tools | Goss (lightweight server validation), InSpec (compliance-focused policy-as-code) |
| What does the STOB mnemonic stand for? | SHA, Time, OS, Builder — the four tags every golden image needs |
Debugging Failed Builds¶
Packer builds fail. The instance is terminated. You're staring at "provisioner shell returned non-zero exit status" and no way to inspect the machine. Here's the toolkit:
| Technique | Command |
|---|---|
| Pause on failure | packer build -on-error=ask . |
| Keep instance alive | packer build -on-error=abort . |
| Full debug logging | PACKER_LOG=1 packer build . |
| Step-by-step | packer build -debug . |
When Packer hangs at "Timeout waiting for SSH," the three most common causes are: (1)
firewall blocks port 22, (2) ssh_username doesn't match the AMI default (Ubuntu:
ubuntu, Amazon Linux: ec2-user), (3) no IP address (private subnet, DHCP failure).
Provisioners and Post-Processors: Quick Reference¶
Provisioners run in order. If one fails, the build fails.
| Provisioner | What it does | Key gotcha |
|---|---|---|
shell |
Runs commands or scripts — the workhorse | Non-login shell: no .bashrc, must use -y on apt |
file |
Copies files from host into image | Copies as SSH user, not root — use /tmp/ then sudo mv |
ansible |
Runs a playbook against the build instance | Packer generates temporary SSH key, passes connection to Ansible |
powershell |
Shell equivalent for Windows builds | Different escaping rules than bash |
Recommended order: File (copy configs in) then Shell (install packages) then Ansible (configure) then Shell (run tests) then Shell (cleanup SSH keys, machine-id, history).
Post-processors act on the finished artifact:
| Post-processor | What it does |
|---|---|
manifest |
Writes AMI ID, builder name, timestamp to JSON — essential for CI/CD |
docker-tag / docker-push |
Tags and pushes Docker images to a registry |
vagrant |
Packages the artifact as a .box file |
checksum |
Generates SHA256 checksum of the output |
Use the only parameter to limit post-processors to specific sources (e.g., Docker-push only runs on the Docker build).
Packer vs Everything Else¶
| Tool | Builds | When to use it instead of Packer |
|---|---|---|
| Dockerfile | Container images | Standard app containers — has layer caching, faster rebuilds |
| cloud-init | Runtime config (not images) | Per-instance settings at boot — secrets, endpoints, hostnames |
| AWS Image Builder | AMIs | AWS-only shops wanting a managed service |
Mental Model: Packer is the build tool, Terraform is the deploy tool, and cloud-init is the customize tool. Packer builds the image, Terraform launches instances from it, cloud-init injects the last-mile configuration. Mixing up their responsibilities is how drift starts.
The Footgun Gallery¶
Things that will bite you, ranked by how much time they waste:
1. Not Pinning Provisioner Versions¶
# Three months from now, this installs a different version
apt-get install -y nginx
curl -fsSL https://get.docker.com | sh
Fix: Pin everything. apt-get install -y nginx=1.18.0-6ubuntu14.4. Use checksums on
downloaded binaries. Pin Ansible collection versions in requirements.yml.
2. Skipping Validate¶
packer build will fail 8 minutes into a 10-minute build because of a syntax error you
could have caught in 2 seconds with packer validate .. Always validate. Make it a CI step.
3. Confusing JSON and HCL2¶
If you find a Packer tutorial using template.json, translate it to HCL2. Use
packer hcl2_upgrade to convert, but review the output — the conversion is not perfect.
4. Leaving Orphaned Cloud Resources¶
Failed builds leave running instances and security groups that accumulate cost silently.
Use -on-error=cleanup in CI and run a weekly sweep for resources tagged packer-builder
older than 24 hours.
Exercises¶
Exercise 1: Read a Template (2 minutes)¶
Look at the first template (ubuntu-nginx.pkr.hcl). Without scrolling back: What instance
type? What Ubuntu version? What package? What post-processor?
Answers
`t3.micro`, Ubuntu Jammy 22.04, `nginx=1.18.0-6ubuntu14.4`, `manifest` (writes AMI ID to JSON)Exercise 2: Spot the Footguns (5 minutes)¶
This template has at least four problems. Find them all.
source "amazon-ebs" "web" {
ami_name = "web-server"
instance_type = "t3.micro"
region = "us-east-1"
source_ami = "ami-0abcdef1234567890"
ssh_username = "root"
}
build {
sources = ["source.amazon-ebs.web"]
provisioner "shell" {
inline = [
"apt-get install nginx",
"echo 'DB_PASSWORD=hunter2' >> /etc/app/.env",
"curl -fsSL https://get.docker.com | sh"
]
}
}
Answers
1. **Hardcoded AMI ID** — use `source_ami_filter` instead 2. **Static `ami_name`** — second build fails. Add `{{timestamp}}` 3. **`ssh_username = "root"`** — Ubuntu uses `ubuntu`. Root SSH is disabled 4. **No `-y` on `apt-get`** — hangs waiting for confirmation 5. **Secret baked into image** — `DB_PASSWORD=hunter2` persists forever 6. **Unpinned Docker install** — not reproducible 7. **No cleanup provisioner** — SSH keys leak into the image 8. **No post-processor** — no way to track which AMI was producedExercise 3: Design a Multi-Platform Template (10 minutes)¶
Your team needs an image with Python 3.11 and Redis client tools, available as an AWS AMI,
a Docker image, and a Vagrant box. Sketch the HCL structure (source blocks, provisioners,
post-processors). Which post-processors need the only parameter and why?
Hint
Three `source` blocks, one `build` block referencing all three. Docker-push and docker-tag need `only = ["docker.app"]`. Remember: Docker has no systemd, so `systemctl enable` needs `|| true`.Cheat Sheet¶
| Command | What it does |
|---|---|
packer init . |
Download plugins |
packer fmt -check . |
Check formatting (CI gate) |
packer validate . |
Syntax check — always run before build |
packer build . |
Execute the build |
packer build -only='amazon-ebs.app' . |
Build a single source |
packer build -var-file=prod.pkrvars.hcl . |
Load variables from file |
packer build -on-error=ask . |
Pause on failure for debugging |
PACKER_LOG=1 packer build . |
Full debug logging |
| Mnemonic | Meaning |
|---|---|
| BOSS | Binaries bake, OS config bake, Secrets boot, Settings boot |
| STOB | SHA, Time, OS, Builder — golden image tags |
Takeaways¶
-
Image drift is a class of incident, not a single bug. It comes from mutable servers and missing image pipelines. Packer eliminates it by making images the only path to production.
-
Bake the base, configure the specifics. Packages, agents, and hardening go in the image. Secrets, endpoints, and per-environment settings go through cloud-init at boot.
-
One template, multiple platforms. Packer's parallel build model means your AWS AMI, Docker image, and Vagrant box come from the same source of truth. Drift between environments dies here.
-
Test your images before they reach production. Goss or InSpec inside the Packer build catches broken images before they launch. A failing test fails the build.
-
Tag everything. SHA, timestamp, OS version, builder identity. When an instance misbehaves at 3 AM, these tags are the difference between a 10-minute fix and a 3-hour investigation.
-
Never bake secrets. This is the footgun that keeps firing. Credentials in images create time bombs that detonate on rotation.
Related Lessons¶
- What Happens When You
docker build— the container image build process from the inside - GitOps: The Repo Is the Truth — version-controlling your infrastructure definitions
- What Happens When You
git pushto CI — the CI pipeline that triggers your Packer build - Ansible Playbook Debugging — debugging the Ansible provisioner when it fails inside a Packer build
- The Terraform State Disaster — the other half of the Packer + Terraform pipeline