Pattern: Stale Image Tag¶
ID: FP-031 Family: Silent Corruption Frequency: Common Blast Radius: Single Service to Multi-Service Detection Difficulty: Actively Misleading
The Shape¶
A container image tag (especially latest or a mutable semantic version) is used in
production. The tag is immutable in the deployment configuration, but the image behind
the tag changes. When a pod restarts (node reboot, OOMKill, rolling restart), it pulls
the current image behind the tag — which may be different from what was originally
deployed. Different pods in the same deployment run different code versions silently.
The deployment configuration hasn't changed; no deploy was triggered; but the behavior
has changed.
How You'll See It¶
In Kubernetes¶
image: myapp:latest in the deployment. Three months ago, latest was v1.2.3.
Someone pushed a new image to latest (v1.4.0) without a deployment. A pod OOMKilled
and was restarted; it pulled v1.4.0. The other 9 pods are running v1.2.3. The service
is now running two versions simultaneously. Users get inconsistent behavior depending
on which pod handles their request.
In Linux/Infrastructure¶
Docker Compose with image: nginx:stable. NGINX team updates the stable tag from
1.24 to 1.25. A host reboots. Docker starts the container, pulls the new stable
image. The container is now running 1.25 instead of 1.24. No change was made to the
Compose file.
In CI/CD¶
CI pipeline uses image: python:3.11 as the build environment. Python publishes a
security patch: python:3.11 now points to 3.11.8 instead of 3.11.6. Build passes on
the new image; tests break because 3.11.8 changed a behavior the tests relied on. The
pipeline "changed" without any commit to the repository.
The Tell¶
Pods in the same deployment are running different image digests despite no deploy.
kubectl get pods -o jsonpath='{.items[*].status.containerStatuses[*].imageID}'shows different SHA256 digests for the same image tag. Behavior inconsistency appeared after a pod restart, not after a deployment.
Common Misdiagnosis¶
| Looks Like | But Actually | How to Tell the Difference |
|---|---|---|
| Application bug | Mixed-version fleet | Some pods behave correctly; others don't; no deploy happened |
| Flaky test | Non-reproducible build environment | Build image tag changed; old and new versions have different behavior |
| Deployment rollout | Unintentional version change on restart | No deploy was triggered; pod restart pulled a new image via mutable tag |
The Fix (Generic)¶
- Immediate: Pin all pods to the same image digest:
image: myapp@sha256:<digest>; restart all pods to normalize the fleet. - Short-term: Use immutable image tags (semantic version + build hash:
v1.2.3-abc1234); never uselatestor mutable tags in production. - Long-term: Enforce image digest pinning in admission webhooks; use image policy controllers (OPA, Kyverno) to reject mutable tags in production namespaces.
Real-World Examples¶
- Example 1:
image: nginx:latestin a production DaemonSet. NGINX major version published tolatest. Nodes rebooted during a kernel patch cycle. Half the nodes ran the old NGINX; half ran the new one. Behavior difference: new NGINX rejected a header format; 50% of requests failed with 400. - Example 2:
image: python:3.9in production. Python 3.9.18 security patch introduced a change indatetimebehavior. The pod that restarted for OOMKill now ran 3.9.18; the others ran 3.9.15. Date serialization was inconsistent across pods.
War Story¶
3 days of a mysterious "intermittent" bug: 30% of API requests returned a different response format. No deploy in 3 days. I checked the deployment: unchanged. I checked the image digests:
kubectl get pods -o json | jq '.items[].status.containerStatuses[].imageID'— two different SHA256 hashes, split 70/30. Someone had pushed a new version to thelatesttag for a hotfix 4 days ago without updating the deployment. Two pods had restarted for unrelated reasons and pulled the new code. We pinned to the digest of the correct version; all pods came up on the same version; bug gone. Now we have a CI check that rejectslatestor untagged images from production deployments.
Cross-References¶
- Topic Packs: k8s-ops, cicd
- Case Studies: ops-archaeology/07-stale-image-tag/
- Footguns: k8s-ops/footguns.md — "
latesttag in production" - Related Patterns: FP-033 (latest tag in prod — the configuration landmine that enables this), FP-024 (health check lying — health passes but behavior is wrong)