Interview Gauntlet: Container Image Build and Distribution Pipeline¶
Category: System Design Difficulty: L2-L3 Duration: 15-20 minutes Domains: Containers, Supply Chain Security
Round 1: The Opening¶
Interviewer: "Design a container image build and distribution pipeline for an organization with 20 services. Walk me through from developer commit to image running in production."
Strong Answer:¶
"The pipeline has four stages: build, scan, store, and distribute. On a push to main, CI (GitHub Actions, GitLab CI, etc.) triggers a multi-stage Docker build. The Dockerfile uses a builder stage with the full SDK and a runtime stage based on a minimal image like distroless or alpine. The build output is tagged with both the git SHA and a semantic version — something like registry.example.com/service-a:v1.2.3 and registry.example.com/service-a:abc123f. After build, the image goes through a vulnerability scan using Trivy or Grype. If critical or high CVEs are found in the runtime image, the pipeline fails and the developer gets a report. Images that pass scanning are pushed to a container registry — Amazon ECR, Harbor, or Google Artifact Registry. The registry is organized per-service with lifecycle policies that prune untagged images older than 30 days. For distribution to production, the deployment system (Argo CD, Flux, or Helm) references images by digest (sha256:...) not just tag, so we have an immutable reference. Each environment (dev, staging, prod) has its own registry or the same registry with access policies that restrict which images can be pulled into production."
Common Weak Answers:¶
- "We build with
docker buildand push to Docker Hub." — No mention of scanning, no base image strategy, no digest pinning. This is a hobby project workflow, not a production pipeline. - Skipping the vulnerability scan — In 2024+, shipping unscanned images to production is a non-starter for any organization that cares about security.
- Using
:latesttag — This is the most common source of "it works in staging but not in prod" container issues. No one should deploy:latestto production.
Round 2: The Probe¶
Interviewer: "Tell me about your base image strategy. How do you keep 20 services' base images consistent and up to date?"
What the interviewer is testing: Whether the candidate has dealt with base image sprawl — the situation where 20 services use 15 different base images at varying patch levels.
Strong Answer:¶
"I'd create an internal base image catalog — 3 to 5 golden base images maintained by the platform team. For example: base-python:3.11, base-node:20, base-go:1.22, and base-jvm:21. These are built weekly from the upstream images with our organization's standard CA certificates, security patches, and common utilities baked in. They're stored in an internal registry and signed using cosign (from the Sigstore project). Each service's Dockerfile starts FROM registry.internal/base-python:3.11 instead of FROM python:3.11-slim. When the base image is rebuilt, a Renovate bot or Dependabot-like process opens PRs against all 20 service repos to update the base image digest. The PR triggers the full CI pipeline including tests, so we know the new base doesn't break anything. Services that are pinned to older base images show up on a dashboard with their CVE exposure. The key discipline is: no service should use a base image older than 30 days. If a service's base image PR has been open for more than a week, it escalates to the tech lead."
Trap Alert:¶
If the candidate bluffs here: The interviewer will ask "How do you sign images with cosign — what does the signature verify?" The honest answer if you haven't used it: "I know cosign uses keyless signing with Fulcio for identity-based signatures, but I haven't implemented it personally. I'd need to work through the Sigstore documentation for the specific workflow." Bluffing about cryptographic signing details is easily caught.
Round 3: The Constraint¶
Interviewer: "A government customer requires deploying to an air-gapped environment. No internet access whatsoever — no Docker Hub, no external registries, no package managers. How do you get images into this environment?"
Strong Answer:¶
"Air-gapped deployment requires a completely offline pipeline. Here's how I'd structure it. First, the build: all images are built in our internet-connected CI environment as normal. We then export the final images as OCI tarballs using crane export or skopeo copy --all to a local directory. These tarballs, along with a SHA256 manifest, are written to a transfer medium — could be a USB drive, a physical disk, or a one-way diode network depending on the classification level. On the air-gapped side, we run a private registry (Harbor is a good choice because it includes scanning and replication features). The tarballs are imported into Harbor using skopeo copy from the directory or crane push. All the image references in our Helm charts or Kubernetes manifests are rewritten to point to the internal Harbor registry — I'd use Kustomize image overrides for this: kustomize edit set image registry.example.com/service-a=harbor.airgap.local/service-a. For base images and OS packages, the same process applies — we'd mirror the specific versions we need. The tricky part is dependency management: any library or package that gets fetched during build needs to be vendored or included in the builder image. For Python, that means running pip download in the connected environment and including the wheel files. For Node, it's npm pack with a pre-populated cache."
The Senior Signal:¶
What separates a senior answer: Mentioning the base image and build-time dependency problem, not just the runtime image transfer. Many candidates describe how to move the final image but forget that the build process itself needs internet access for package installation. Vendoring dependencies or using a multi-stage build where the builder stage includes all dependencies is the key insight. Also: knowing that
skopeohandles OCI images without needing a Docker daemon, which matters in CI environments and security-conscious deployments.
Round 4: The Curveball¶
Interviewer: "Six months into production, you discover that a base image you've been using for 3 months contained a backdoored dependency — a supply chain attack similar to the xz-utils incident. How do you respond and how do you prevent this in the future?"
Strong Answer:¶
"Immediate response: identify every image built on the compromised base image in the last 3 months. This is where the SBOM (Software Bill of Materials) is critical — if we've been generating SBOMs with Syft or Trivy during build and storing them alongside the images, I can query for every image that includes the compromised package version. For each affected image, I need to determine: was the backdoor exploitable in our runtime context? What access did the compromised service have? This is blast radius assessment. Then: rebuild every affected image from a clean base, deploy the patched versions, and rotate any credentials that the compromised services had access to — because we have to assume they were exfiltrated. For prevention: this is where image provenance matters. I'd implement SLSA (Supply-chain Levels for Software Artifacts) level 2 or higher. That means: builds happen on a hardened CI system with audit logs, every build produces an attestation signed by the CI system's identity (using Sigstore), and the admission controller in Kubernetes (Kyverno or Sigstore's policy-controller) verifies the attestation before allowing a pod to run. For the base images specifically, I'd pin to specific digests (not tags), build from a known-good upstream source, and audit the full dependency tree with syft on every base image rebuild."
Trap Question Variant:¶
The right answer acknowledges the limits of detection. Candidates who say "our scanning would have caught it" are likely wrong — the xz-utils backdoor was designed to evade automated scanning and was found by a human noticing a performance regression. The honest answer: "Vulnerability scanners catch known CVEs but not novel supply chain attacks. Detection of this class of attack requires multiple layers: reproducible builds, SBOM diffing between releases, runtime behavior monitoring, and sometimes just luck. No single tool prevents this."
Round 5: The Synthesis¶
Interviewer: "Build speed vs security. Your developers complain that the image pipeline adds 8 minutes to every build — scanning, signing, SBOM generation. They want to skip it for dev environments. How do you handle this?"
Strong Answer:¶
"I'd agree to a tiered approach but with guardrails. Dev and feature branch builds: skip the full vulnerability scan and SBOM generation, but keep the multi-stage build and basic lint. This gets builds down to 2-3 minutes. Images go to a dev registry namespace and are tagged as unverified. Staging and main branch builds: full pipeline — scan, sign, SBOM, attestation. This is the gate. The key is that the Kubernetes admission controller in staging and production only allows images from the verified registry namespace with valid signatures. So even if someone tries to deploy a dev-pipeline image to production, the admission controller blocks it. But I'd also invest in making the full pipeline faster: layer caching with BuildKit, parallel scanning (Trivy can scan while the image is being pushed), and incremental SBOM generation that only re-analyzes changed layers. Most of that 8 minutes is solvable with engineering effort. The conversation with developers isn't really about security vs speed — it's about feedback loop optimization. If the full pipeline took 2 minutes, nobody would ask to skip it. So I'd set a target: full pipeline under 3 minutes, and invest the engineering effort to get there."
What This Sequence Tested:¶
| Round | Skill Tested |
|---|---|
| 1 | Container image pipeline architecture fundamentals |
| 2 | Base image management and organizational governance |
| 3 | Air-gapped deployment and offline distribution |
| 4 | Supply chain security incident response and prevention |
| 5 | Balancing developer experience with security requirements |