Skip to content

Packer: Building Machine Images That Don't Lie

  • lesson
  • packer
  • machine-images
  • immutable-infrastructure
  • ansible
  • docker
  • ci/cd
  • image-testing
  • golden-images
  • cloud-init ---# Packer — Building Machine Images That Don't Lie

Topics: Packer, machine images, immutable infrastructure, Ansible, Docker, CI/CD, image testing, golden images, cloud-init Strategy: Build-up + parallel Level: L1–L2 (Foundations to Operations) Time: 75–90 minutes Prerequisites: None (everything is explained from scratch)


The Mission

Your team runs 40 EC2 instances behind an autoscaling group. Every instance launches from the same AMI — at least, that's what everyone believes. But last Thursday, a new instance joined the group and immediately started throwing 500s. The app binary was there. Nginx was there. But the monitoring agent was missing, and /etc/app/config.yaml had settings from two releases ago.

How? Someone SSHed into a running instance three months ago, "fixed" the config, and never updated the image. The autoscaler launched a fresh instance from the original AMI — the one without the fix. The instance was "correct" by the image's definition and wrong by reality's definition.

This is image drift, and it will ruin your weekend. This lesson builds the cure: a Packer-driven image pipeline where every server boots from a known, tested, version-controlled image. No hand-patching. No snowflakes. No surprises at 3 AM.


Why Machine Images? (The Debate You Need to Understand)

There are two schools of thought about how to get software onto a server, and most teams use both without realizing they have chosen a philosophy.

School 1: Configure at Boot (Config Management)

Launch a bare OS image. On first boot, Ansible/Chef/Puppet runs and installs everything. Every instance converges to the desired state.

bare OS image → boot → Ansible runs → packages installed → configs written → ready
                                    (5–15 minutes)

School 2: Bake Everything In (Golden Images)

Install everything into the image at build time. Launch it. It is ready instantly.

Packer builds image → packages + configs baked in → launch → ready
                                                    (30 seconds)
Factor Config at Boot Golden Image
Launch speed 5–15 minutes 30 seconds
Drift risk High (convergence can fail) Low (image is immutable)
Debugging Check Ansible logs on every instance Check one build log
Rollback Re-run old playbook (hope it works) Launch old AMI
Cost Compute time on every boot Build time once

Mental Model: Think of it like restaurants. Config management is cooking each meal to order — flexible, but slow and error-prone at scale. Golden images are meal-prepping on Sunday — fast to serve, consistent every time, but you have to rebuild the whole batch when the recipe changes.

The real world uses both: bake the base, configure the specifics. Packer builds an image with the OS, packages, agents, and hardening. Cloud-init injects secrets, hostnames, and endpoints at boot.

Remember: The bake-vs-boot mnemonic: BOSSBinaries bake, OS config bake, Secrets boot, Settings (env-specific) boot.


What Is Packer?

Packer builds identical machine images for multiple platforms from a single source configuration. You define what the image should contain once. Packer produces an AMI, a Docker image, a Vagrant box — whatever you need — from that definition. It is not a configuration management tool. It does not run on live servers. It runs once, produces an artifact, and exits.

Name Origin: Packer was created by Mitchell Hashimoto (HashiCorp co-founder) and released in July 2013. It was one of HashiCorp's earliest tools, predating Terraform (2014). Before Packer, teams built images by booting a VM, manually installing software, and snapshotting — a non-reproducible process called "golden image by hand."

The Four Building Blocks

Concept Role Analogy
Template HCL2 file(s) defining the entire build The recipe
Builder Plugin that creates the image for a platform (AWS, Docker, QEMU) The kitchen
Provisioner Runs inside the build to install/configure (shell, Ansible, file) The chef
Post-processor Acts on the finished artifact (manifest, push, compress) The packaging line

Builders create the blank canvas. Provisioners paint on it. Post-processors ship it.


Your First Packer Template

Before theory gets heavy, let's look at a real template. This builds an Ubuntu AMI on AWS with Nginx installed.

# ubuntu-nginx.pkr.hcl

packer {
  required_plugins {
    amazon = {
      version = ">= 1.3.0"
      source  = "github.com/hashicorp/amazon"
    }
  }
}

variable "aws_region" {
  type    = string
  default = "us-east-1"
}

source "amazon-ebs" "ubuntu" {
  ami_name      = "ubuntu-nginx-{{timestamp}}"
  instance_type = "t3.micro"
  region        = var.aws_region

  source_ami_filter {
    filters = {
      name                = "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    }
    owners      = ["099720109477"]  # Canonical's AWS account ID
    most_recent = true
  }

  ssh_username = "ubuntu"

  tags = {
    Name       = "ubuntu-nginx"
    Built-By   = "packer"
    Git-SHA    = "{{env `GIT_SHA`}}"
    Build-Time = "{{timestamp}}"
  }
}

build {
  sources = ["source.amazon-ebs.ubuntu"]

  provisioner "shell" {
    inline = [
      "sudo apt-get update",
      "sudo apt-get install -y nginx=1.18.0-6ubuntu14.4",
      "sudo systemctl enable nginx"
    ]
  }

  post-processor "manifest" {
    output     = "manifest.json"
    strip_path = true
  }
}

Let's break down what each piece does:

Block What it does
packer { required_plugins } Declares plugin dependencies — packer init downloads them
variable "aws_region" Parameterizes the region so you can override it per environment
source "amazon-ebs" "ubuntu" Configures the builder: launch a t3.micro, find the latest Ubuntu 22.04 AMI, SSH in as ubuntu
source_ami_filter Finds the latest official Canonical AMI dynamically instead of hardcoding an AMI ID
build { sources } References which source(s) to build
provisioner "shell" Runs commands inside the build instance — installs and enables Nginx
post-processor "manifest" Writes the resulting AMI ID to manifest.json for downstream tools

Run It

# Download plugins
packer init .

# Check for syntax errors (2 seconds vs finding out 10 minutes into a build)
packer validate .

# Build the image
packer build .

Gotcha: packer validate catches syntax errors. It does NOT catch runtime errors like "this AMI doesn't exist" or "your AWS credentials are expired." Those fail during packer build. Always validate, but don't treat a passing validate as a guarantee.


Flashcard Check #1

Cover the answers. Test yourself.

Question Answer
What are Packer's four building blocks? Template, Builder, Provisioner, Post-processor
What's the difference between a builder and a provisioner? Builder creates the blank image for a platform; provisioner installs software inside it
Why use source_ami_filter instead of hardcoding an AMI ID? The latest AMI ID changes with every Canonical release; the filter always finds the newest one
What does packer init do? Downloads plugins defined in required_plugins blocks
What command format is recommended: JSON or HCL2? HCL2 — JSON templates are legacy (pre-2020) and lack variables, locals, and functions

The Full Pipeline: Packer + Ansible + Testing

A shell provisioner works for simple cases. For real infrastructure, you want Ansible running your hardening playbook, your monitoring agent role, and your application setup — the same playbook you'd use anywhere, just executed inside a Packer build.

# golden-ami.pkr.hcl — production template with Ansible + cleanup

packer {
  required_plugins {
    amazon  = { version = ">= 1.3.0", source = "github.com/hashicorp/amazon" }
    ansible = { version = ">= 1.1.0", source = "github.com/hashicorp/ansible" }
  }
}

variable "app_version" { type = string }
variable "env"         { type = string; default = "dev" }

locals {
  ami_name = "app-${var.app_version}-${formatdate("YYYYMMDD-hhmm", timestamp())}"
}

source "amazon-ebs" "app" {
  ami_name      = local.ami_name
  instance_type = "t3.medium"
  region        = "us-east-1"
  source_ami_filter {
    filters = { name = "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*" }
    owners      = ["099720109477"]
    most_recent = true
  }
  ssh_username = "ubuntu"
  ami_regions  = ["us-east-1", "us-west-2", "eu-west-1"]
  tags = {
    Name        = local.ami_name
    App-Version = var.app_version
    Git-SHA     = "{{env `GIT_SHA`}}"
    Built-By    = "packer"
    Base-OS     = "ubuntu-22.04"
  }
}

build {
  sources = ["source.amazon-ebs.app"]

  provisioner "file" {
    source      = "files/cloud-init-defaults.yaml"
    destination = "/tmp/cloud-init-defaults.yaml"
  }

  provisioner "ansible" {
    playbook_file   = "ansible/site.yml"
    extra_arguments = ["--extra-vars", "app_version=${var.app_version} env=${var.env}", "-v"]
    ansible_env_vars = ["ANSIBLE_HOST_KEY_CHECKING=False"]
  }

  # Security cleanup — never skip this
  provisioner "shell" {
    inline = [
      "rm -f /home/*/.ssh/authorized_keys",
      "rm -f /root/.ssh/authorized_keys",
      "sudo truncate -s 0 /etc/machine-id",
      "sudo rm -rf /tmp/* /var/tmp/*",
      "history -c"
    ]
  }

  post-processor "manifest" { output = "manifest.json"; strip_path = true }
}

Under the Hood: When you use the ansible provisioner, Packer generates a temporary SSH keypair, boots the build instance, and passes the SSH connection details to Ansible. Ansible runs against the build instance exactly like any other target — same playbook, same roles, same variables. The only difference: this instance will be snapshotted into an image and destroyed.

Why the Cleanup Step Matters

That cleanup provisioner is not optional. Without it: Packer's temporary SSH key stays in authorized_keys (anyone with it can SSH into every launched instance), /etc/machine-id is identical across all instances (breaks DHCP and journal logging), and shell history leaks build commands.

Gotcha: Packer creates a temporary security group with port 22 open to 0.0.0.0/0. If the build fails and cleanup doesn't run, that security group lingers. Always use -on-error=cleanup (the default) in CI.


War Story: The AMI That Worked in Dev

War Story: A fintech team baked database credentials into their golden AMI. Months later, they rotated the password. Every new instance launched from the old AMI silently connected with stale credentials — intermittent auth failures that took three days to trace. Fix: 10 minutes. Finding the cause: 72 hours and a sev-2 incident. The rule: images contain software and configuration, never credentials.

Violating the bake-vs-boot boundary doesn't fail immediately. It creates a time bomb that detonates on credential rotation, key expiration, or certificate renewal.


Parallel Builds: One Template, Multiple Platforms

Here is where Packer's design really shines. You have one team that runs on AWS, another that uses Docker for local development, and a third that uses Vagrant for testing. Three platforms, one source of truth.

# multi-platform.pkr.hcl — three sources, one build block

source "amazon-ebs" "app" {
  ami_name      = "app-{{timestamp}}"
  instance_type = "t3.micro"
  region        = "us-east-1"
  source_ami_filter {
    filters = { name = "ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*" }
    owners      = ["099720109477"]
    most_recent = true
  }
  ssh_username = "ubuntu"
}

source "docker" "app" {
  image  = "ubuntu:22.04"
  commit = true
}

source "vagrant" "app" {
  source_path  = "ubuntu/jammy64"
  provider     = "virtualbox"
  communicator = "ssh"
}

build {
  sources = [
    "source.amazon-ebs.app",
    "source.docker.app",
    "source.vagrant.app"
  ]

  provisioner "shell" {
    inline = [
      "sudo apt-get update",
      "sudo apt-get install -y nginx=1.18.0-6ubuntu14.4 curl jq",
      "sudo systemctl enable nginx || true"  # Docker has no systemd
    ]
  }

  provisioner "ansible" {
    playbook_file = "ansible/app.yml"
  }

  # Docker-specific post-processors use "only" to target one source
  post-processor "docker-tag" {
    repository = "registry.internal/app"
    tags       = ["latest", var.build_tag]
    only       = ["docker.app"]
  }

  post-processor "manifest" {
    output     = "manifest.json"
    strip_path = true
  }
}

Build all three at once:

packer build .

Build just the Docker image:

packer build -only='docker.app' .

Build just the AMI:

packer build -only='amazon-ebs.app' .

Trivia: Packer executes multi-platform builds in parallel by default. If you define three sources, Packer launches three builds simultaneously. This was unusual for infrastructure tools in 2013. The build-once-deploy-many model means the slow build happens once (15–30 minutes for an AMI) and then hundreds of servers launch from the pre-baked image in seconds.

Packer Docker vs Dockerfile

Most Docker images should use a Dockerfile — it has layer caching and the ecosystem expects it. Packer's Docker builder exists for one specific case: when you need the same provisioning to produce both a VM image and a container image from one template. If you only need a container, use a Dockerfile.


Flashcard Check #2

Question Answer
Why should you never bake secrets into a machine image? Images are copied, shared, and stored — secrets persist in every copy forever
What does packer build -only='docker.app' . do? Builds only the Docker source, skipping other sources in the template
What's the cleanup provisioner for? Removes temporary SSH keys, clears machine-id, deletes history — prevents security issues

Variables: No More Hardcoded Values

You already saw variables in the full template above. Here's the quick reference for how to set them — three ways, in order of precedence (highest wins):

# 1. Command-line flag (highest precedence)
packer build -var 'aws_region=us-west-2' -var 'app_version=2.1.0' .

# 2. Variable file
packer build -var-file=prod.pkrvars.hcl .

# 3. Environment variable (prefix with PKR_VAR_)
export PKR_VAR_aws_region=us-west-2
packer build .

Variable files are HCL: aws_region = "us-east-1", one per line. Use locals {} blocks for computed values that appear in multiple places (like AMI names with timestamps).

Gotcha: timestamp() returns a different value every time Packer runs. If you need reproducible names for testing, pass the timestamp as a variable instead.


The Immutable Infrastructure Pattern

Packer was built on a philosophy: servers should never be modified after deployment. If you need a change, build a new image and replace the old instances.

Traditional (mutable):
  server → patch → patch → patch → drift → mystery config → 3am incident

Immutable:
  image v1 → deploy → works
  change needed → image v2 → deploy → works
  rollback needed → image v1 → deploy → works

Trivia: The "immutable infrastructure" concept was championed by Chad Fowler in a 2013 blog post, the same year Packer was released. Netflix was one of Packer's earliest prominent users, using it to bake AMIs for their entire fleet — an approach that became an industry best practice.

How Immutable Infrastructure Connects to CI/CD

The image pipeline looks like this:

code change → CI triggers Packer build → image created
  → automated tests (boot, smoke, compliance)
  → promote to production account
  → Terraform deploys new instances from promoted image
  → old instances drained and terminated

Packer owns the image. Terraform owns the infrastructure. Clean boundary.


Testing Images Before Production

Building an image without testing it is like shipping code without running tests. The image boots fine on your laptop's mental model — but does it actually start Nginx? Does the health check pass? Is the right version of Python installed?

Goss — Lightweight Image Testing

Goss is a YAML-based server validation tool. Write what you expect, run it, get a pass/fail.

# goss/goss.yaml — declare what "correct" looks like
package:
  nginx: { installed: true, versions: ["1.18.0"] }
service:
  nginx: { enabled: true }
port:
  tcp:80: { listening: true }
file:
  /etc/app/config.yaml: { exists: true }

Run it as the last provisioner — if Goss exits non-zero, the build fails:

provisioner "file" {
  source = "goss/goss.yaml"; destination = "/tmp/goss.yaml"
}
provisioner "shell" {
  inline = [
    "curl -fsSL https://goss.rocks/install | GOSS_VER=v0.4.4 sh",
    "goss -g /tmp/goss.yaml validate --retry-timeout 30s"
  ]
}

InSpec — Compliance-Focused Testing

For compliance (CIS benchmarks, STIG, PCI-DSS), InSpec runs policy-as-code against the image. The CI pattern: Packer builds, a test job boots an instance, runs InSpec, tears down, and only promotes the AMI if all checks pass.

Packer build → deploy test instance → run Goss/InSpec → terminate
  → pass? → copy AMI to prod account, tag as "approved"
  → fail? → alert team, do not promote

Interview Tip: "How do you ensure your AMIs are secure and up to date?" Strong answer: Packer pipeline builds weekly from the latest base AMI, runs CIS hardening via Ansible, validates with InSpec/Goss tests, promotes through dev → staging → prod accounts. Old AMIs are deregistered after 90 days.


Image Pipeline in CI/CD

The CI workflow follows this pattern: push to packer/** triggers a build, or a weekly cron rebuilds to pick up base image patches.

# .github/workflows/build-ami.yml (key steps)
on:
  push:
    paths: ["packer/**", "ansible/**"]
  schedule:
    - cron: "0 6 * * 1"  # Weekly Monday 6am UTC

jobs:
  build:
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-packer@main
      - run: packer init packer/
      - run: packer fmt -check packer/ && packer validate packer/
      - run: packer build -var "app_version=${{ github.sha }}" packer/
        env:
          GIT_SHA: ${{ github.sha }}

Use OIDC federation (aws-actions/configure-aws-credentials with role-to-assume) instead of long-lived AWS access keys in CI secrets.

Remember: Tag every AMI with: git SHA, build timestamp, base OS version, and builder identity. Mnemonic: STOBSHA, Time, OS, Builder. When an instance misbehaves in production, these tags let you trace it back to the exact commit that built the image.


Flashcard Check #3

Question Answer
What is the immutable infrastructure pattern? Never modify servers after deployment — build a new image and replace
Name two image testing tools Goss (lightweight server validation), InSpec (compliance-focused policy-as-code)
What does the STOB mnemonic stand for? SHA, Time, OS, Builder — the four tags every golden image needs

Debugging Failed Builds

Packer builds fail. The instance is terminated. You're staring at "provisioner shell returned non-zero exit status" and no way to inspect the machine. Here's the toolkit:

Technique Command
Pause on failure packer build -on-error=ask .
Keep instance alive packer build -on-error=abort .
Full debug logging PACKER_LOG=1 packer build .
Step-by-step packer build -debug .

When Packer hangs at "Timeout waiting for SSH," the three most common causes are: (1) firewall blocks port 22, (2) ssh_username doesn't match the AMI default (Ubuntu: ubuntu, Amazon Linux: ec2-user), (3) no IP address (private subnet, DHCP failure).


Provisioners and Post-Processors: Quick Reference

Provisioners run in order. If one fails, the build fails.

Provisioner What it does Key gotcha
shell Runs commands or scripts — the workhorse Non-login shell: no .bashrc, must use -y on apt
file Copies files from host into image Copies as SSH user, not root — use /tmp/ then sudo mv
ansible Runs a playbook against the build instance Packer generates temporary SSH key, passes connection to Ansible
powershell Shell equivalent for Windows builds Different escaping rules than bash

Recommended order: File (copy configs in) then Shell (install packages) then Ansible (configure) then Shell (run tests) then Shell (cleanup SSH keys, machine-id, history).

Post-processors act on the finished artifact:

Post-processor What it does
manifest Writes AMI ID, builder name, timestamp to JSON — essential for CI/CD
docker-tag / docker-push Tags and pushes Docker images to a registry
vagrant Packages the artifact as a .box file
checksum Generates SHA256 checksum of the output

Use the only parameter to limit post-processors to specific sources (e.g., Docker-push only runs on the Docker build).


Packer vs Everything Else

Tool Builds When to use it instead of Packer
Dockerfile Container images Standard app containers — has layer caching, faster rebuilds
cloud-init Runtime config (not images) Per-instance settings at boot — secrets, endpoints, hostnames
AWS Image Builder AMIs AWS-only shops wanting a managed service

Mental Model: Packer is the build tool, Terraform is the deploy tool, and cloud-init is the customize tool. Packer builds the image, Terraform launches instances from it, cloud-init injects the last-mile configuration. Mixing up their responsibilities is how drift starts.


Things that will bite you, ranked by how much time they waste:

1. Not Pinning Provisioner Versions

# Three months from now, this installs a different version
apt-get install -y nginx
curl -fsSL https://get.docker.com | sh

Fix: Pin everything. apt-get install -y nginx=1.18.0-6ubuntu14.4. Use checksums on downloaded binaries. Pin Ansible collection versions in requirements.yml.

2. Skipping Validate

packer build will fail 8 minutes into a 10-minute build because of a syntax error you could have caught in 2 seconds with packer validate .. Always validate. Make it a CI step.

3. Confusing JSON and HCL2

If you find a Packer tutorial using template.json, translate it to HCL2. Use packer hcl2_upgrade to convert, but review the output — the conversion is not perfect.

4. Leaving Orphaned Cloud Resources

Failed builds leave running instances and security groups that accumulate cost silently. Use -on-error=cleanup in CI and run a weekly sweep for resources tagged packer-builder older than 24 hours.


Exercises

Exercise 1: Read a Template (2 minutes)

Look at the first template (ubuntu-nginx.pkr.hcl). Without scrolling back: What instance type? What Ubuntu version? What package? What post-processor?

Answers `t3.micro`, Ubuntu Jammy 22.04, `nginx=1.18.0-6ubuntu14.4`, `manifest` (writes AMI ID to JSON)

Exercise 2: Spot the Footguns (5 minutes)

This template has at least four problems. Find them all.

source "amazon-ebs" "web" {
  ami_name      = "web-server"
  instance_type = "t3.micro"
  region        = "us-east-1"
  source_ami    = "ami-0abcdef1234567890"
  ssh_username  = "root"
}

build {
  sources = ["source.amazon-ebs.web"]

  provisioner "shell" {
    inline = [
      "apt-get install nginx",
      "echo 'DB_PASSWORD=hunter2' >> /etc/app/.env",
      "curl -fsSL https://get.docker.com | sh"
    ]
  }
}
Answers 1. **Hardcoded AMI ID** — use `source_ami_filter` instead 2. **Static `ami_name`** — second build fails. Add `{{timestamp}}` 3. **`ssh_username = "root"`** — Ubuntu uses `ubuntu`. Root SSH is disabled 4. **No `-y` on `apt-get`** — hangs waiting for confirmation 5. **Secret baked into image** — `DB_PASSWORD=hunter2` persists forever 6. **Unpinned Docker install** — not reproducible 7. **No cleanup provisioner** — SSH keys leak into the image 8. **No post-processor** — no way to track which AMI was produced

Exercise 3: Design a Multi-Platform Template (10 minutes)

Your team needs an image with Python 3.11 and Redis client tools, available as an AWS AMI, a Docker image, and a Vagrant box. Sketch the HCL structure (source blocks, provisioners, post-processors). Which post-processors need the only parameter and why?

Hint Three `source` blocks, one `build` block referencing all three. Docker-push and docker-tag need `only = ["docker.app"]`. Remember: Docker has no systemd, so `systemctl enable` needs `|| true`.

Cheat Sheet

Command What it does
packer init . Download plugins
packer fmt -check . Check formatting (CI gate)
packer validate . Syntax check — always run before build
packer build . Execute the build
packer build -only='amazon-ebs.app' . Build a single source
packer build -var-file=prod.pkrvars.hcl . Load variables from file
packer build -on-error=ask . Pause on failure for debugging
PACKER_LOG=1 packer build . Full debug logging
Mnemonic Meaning
BOSS Binaries bake, OS config bake, Secrets boot, Settings boot
STOB SHA, Time, OS, Builder — golden image tags

Takeaways

  • Image drift is a class of incident, not a single bug. It comes from mutable servers and missing image pipelines. Packer eliminates it by making images the only path to production.

  • Bake the base, configure the specifics. Packages, agents, and hardening go in the image. Secrets, endpoints, and per-environment settings go through cloud-init at boot.

  • One template, multiple platforms. Packer's parallel build model means your AWS AMI, Docker image, and Vagrant box come from the same source of truth. Drift between environments dies here.

  • Test your images before they reach production. Goss or InSpec inside the Packer build catches broken images before they launch. A failing test fails the build.

  • Tag everything. SHA, timestamp, OS version, builder identity. When an instance misbehaves at 3 AM, these tags are the difference between a 10-minute fix and a 3-hour investigation.

  • Never bake secrets. This is the footgun that keeps firing. Credentials in images create time bombs that detonate on rotation.