Skip to content

Pattern: Stale Image Tag

ID: FP-031 Family: Silent Corruption Frequency: Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Actively Misleading

The Shape

A container image tag (especially latest or a mutable semantic version) is used in production. The tag is immutable in the deployment configuration, but the image behind the tag changes. When a pod restarts (node reboot, OOMKill, rolling restart), it pulls the current image behind the tag — which may be different from what was originally deployed. Different pods in the same deployment run different code versions silently. The deployment configuration hasn't changed; no deploy was triggered; but the behavior has changed.

How You'll See It

In Kubernetes

image: myapp:latest in the deployment. Three months ago, latest was v1.2.3. Someone pushed a new image to latest (v1.4.0) without a deployment. A pod OOMKilled and was restarted; it pulled v1.4.0. The other 9 pods are running v1.2.3. The service is now running two versions simultaneously. Users get inconsistent behavior depending on which pod handles their request.

In Linux/Infrastructure

Docker Compose with image: nginx:stable. NGINX team updates the stable tag from 1.24 to 1.25. A host reboots. Docker starts the container, pulls the new stable image. The container is now running 1.25 instead of 1.24. No change was made to the Compose file.

In CI/CD

CI pipeline uses image: python:3.11 as the build environment. Python publishes a security patch: python:3.11 now points to 3.11.8 instead of 3.11.6. Build passes on the new image; tests break because 3.11.8 changed a behavior the tests relied on. The pipeline "changed" without any commit to the repository.

The Tell

Pods in the same deployment are running different image digests despite no deploy. kubectl get pods -o jsonpath='{.items[*].status.containerStatuses[*].imageID}' shows different SHA256 digests for the same image tag. Behavior inconsistency appeared after a pod restart, not after a deployment.

Common Misdiagnosis

Looks Like But Actually How to Tell the Difference
Application bug Mixed-version fleet Some pods behave correctly; others don't; no deploy happened
Flaky test Non-reproducible build environment Build image tag changed; old and new versions have different behavior
Deployment rollout Unintentional version change on restart No deploy was triggered; pod restart pulled a new image via mutable tag

The Fix (Generic)

  1. Immediate: Pin all pods to the same image digest: image: myapp@sha256:<digest>; restart all pods to normalize the fleet.
  2. Short-term: Use immutable image tags (semantic version + build hash: v1.2.3-abc1234); never use latest or mutable tags in production.
  3. Long-term: Enforce image digest pinning in admission webhooks; use image policy controllers (OPA, Kyverno) to reject mutable tags in production namespaces.

Real-World Examples

  • Example 1: image: nginx:latest in a production DaemonSet. NGINX major version published to latest. Nodes rebooted during a kernel patch cycle. Half the nodes ran the old NGINX; half ran the new one. Behavior difference: new NGINX rejected a header format; 50% of requests failed with 400.
  • Example 2: image: python:3.9 in production. Python 3.9.18 security patch introduced a change in datetime behavior. The pod that restarted for OOMKill now ran 3.9.18; the others ran 3.9.15. Date serialization was inconsistent across pods.

War Story

3 days of a mysterious "intermittent" bug: 30% of API requests returned a different response format. No deploy in 3 days. I checked the deployment: unchanged. I checked the image digests: kubectl get pods -o json | jq '.items[].status.containerStatuses[].imageID' — two different SHA256 hashes, split 70/30. Someone had pushed a new version to the latest tag for a hotfix 4 days ago without updating the deployment. Two pods had restarted for unrelated reasons and pulled the new code. We pinned to the digest of the correct version; all pods came up on the same version; bug gone. Now we have a CI check that rejects latest or untagged images from production deployments.

Cross-References