Investigation: CI Pipeline Fails, Docker Layer Cache Corruption, Fix Is Registry GC¶
Phase 1: DevOps Tooling Investigation (Dead End)¶
Check the CI pipeline configuration:
# .github/workflows/ci.yml
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: registry.internal:5000/app:${{ github.sha }}
cache-from: type=registry,ref=registry.internal:5000/app:cache
cache-to: type=registry,ref=registry.internal:5000/app:cache,mode=max
Configuration looks correct. Check the Dockerfile:
# No recent changes to Dockerfile
$ git log --oneline -5 -- Dockerfile
a1b2c3d 6 days ago chore: bump python base to 3.11.8
Try to pull the cache manifest manually:
$ docker manifest inspect registry.internal:5000/app:cache
{
"schemaVersion": 2,
"mediaType": "application/vnd.oci.image.index.v1+json",
"manifests": [
{
"mediaType": "application/vnd.buildkit.cacheconfig.v0",
"digest": "sha256:a8f2b...",
"size": 4821
}
]
}
# Try to pull the specific digest
$ docker pull registry.internal:5000/app@sha256:a8f2b...
Error response from daemon: manifest for registry.internal:5000/app@sha256:a8f2b... not found
The cache manifest references a digest that does not exist in the registry. The manifest metadata is there but the actual layer blob is missing.
The Pivot¶
Why would a blob exist in the manifest but not in storage? Check the registry's storage backend:
# Harbor registry runs on Kubernetes
$ kubectl exec -it harbor-registry-0 -n harbor -- du -sh /storage/
47G /storage/
$ kubectl exec -it harbor-registry-0 -n harbor -- ls /storage/docker/registry/v2/blobs/sha256/a8/a8f2b*
ls: cannot access '/storage/docker/registry/v2/blobs/sha256/a8/a8f2b*': No such file or directory
The blob sha256:a8f2b... is referenced by the cache manifest but does not exist on disk. Check the registry's filesystem:
$ kubectl exec -it harbor-registry-0 -n harbor -- df -h /storage
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 50G 47G 2.1G 96% /storage
The storage volume is 96% full. Check if garbage collection has been running:
$ kubectl logs -n harbor deploy/harbor-jobservice --since=168h | grep "gc"
2026-03-16T02:00:00Z [INFO] Starting garbage collection
2026-03-16T02:00:45Z [ERROR] GC failed: filesystem is almost full, refusing to operate
2026-03-16T02:00:45Z [INFO] GC completed with errors
Phase 2: Linux Ops Investigation (Root Cause)¶
The registry's persistent volume is almost full. Garbage collection has been failing for at least 3 days. Without GC, orphaned layers accumulate. But the real problem is worse — the filesystem filled up during a push operation, causing a partial write:
$ kubectl logs -n harbor harbor-registry-0 --since=72h | grep -i "error\|write\|disk"
2026-03-17T14:22:31Z [ERROR] blob upload failed: write /storage/docker/registry/v2/blobs/sha256/a8/a8f2b.../data: no space left on device
2026-03-17T14:22:31Z [WARN] partial blob written, metadata committed but data incomplete
Three days ago, a cache push partially succeeded — the manifest was updated to reference sha256:a8f2b... but the actual blob write failed due to disk full. The manifest now points to a non-existent blob. Every build that tries to use this cache hits the missing blob and fails.
Check what is consuming the space:
$ kubectl exec -it harbor-registry-0 -n harbor -- \
find /storage -name "data" -size +100M | head -10 | xargs du -sh
12G /storage/docker/registry/v2/blobs/sha256/3c/3c8d.../data
8.4G /storage/docker/registry/v2/blobs/sha256/f1/f192.../data
5.2G /storage/docker/registry/v2/blobs/sha256/b7/b7a4.../data
Old, untagged image layers consuming gigabytes. These should have been cleaned up by GC.
Domain Bridge: Why This Crossed Domains¶
Key insight: The symptom was a CI pipeline failure (devops_tooling), the root cause was a corrupted cache blob caused by a disk-full condition on the registry's persistent volume (linux_ops), and the fix requires running registry garbage collection and resizing the volume (kubernetes_ops). This is common because: container image registries are storage-intensive services. Without garbage collection, they fill their volumes with orphaned layers. A full disk can corrupt cache manifests, and CI pipelines break when their cache references become invalid.
Root Cause¶
The Harbor registry's 50GB persistent volume filled to 96%. Garbage collection had been failing for days due to the full filesystem. During a CI cache push, the disk ran out of space mid-write — the cache manifest was updated to reference a new blob, but the blob data was never fully written. All subsequent CI builds failed because the cache manifest pointed to a non-existent layer.