Skip to content

Investigation: CI Pipeline Fails, Docker Layer Cache Corruption, Fix Is Registry GC

Phase 1: DevOps Tooling Investigation (Dead End)

Check the CI pipeline configuration:

# .github/workflows/ci.yml
- name: Build and push
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: registry.internal:5000/app:${{ github.sha }}
    cache-from: type=registry,ref=registry.internal:5000/app:cache
    cache-to: type=registry,ref=registry.internal:5000/app:cache,mode=max

Configuration looks correct. Check the Dockerfile:

# No recent changes to Dockerfile
$ git log --oneline -5 -- Dockerfile
a1b2c3d  6 days ago  chore: bump python base to 3.11.8

Try to pull the cache manifest manually:

$ docker manifest inspect registry.internal:5000/app:cache
{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "manifests": [
    {
      "mediaType": "application/vnd.buildkit.cacheconfig.v0",
      "digest": "sha256:a8f2b...",
      "size": 4821
    }
  ]
}

# Try to pull the specific digest
$ docker pull registry.internal:5000/app@sha256:a8f2b...
Error response from daemon: manifest for registry.internal:5000/app@sha256:a8f2b... not found

The cache manifest references a digest that does not exist in the registry. The manifest metadata is there but the actual layer blob is missing.

The Pivot

Why would a blob exist in the manifest but not in storage? Check the registry's storage backend:

# Harbor registry runs on Kubernetes
$ kubectl exec -it harbor-registry-0 -n harbor -- du -sh /storage/
47G     /storage/

$ kubectl exec -it harbor-registry-0 -n harbor -- ls /storage/docker/registry/v2/blobs/sha256/a8/a8f2b*
ls: cannot access '/storage/docker/registry/v2/blobs/sha256/a8/a8f2b*': No such file or directory

The blob sha256:a8f2b... is referenced by the cache manifest but does not exist on disk. Check the registry's filesystem:

$ kubectl exec -it harbor-registry-0 -n harbor -- df -h /storage
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        50G   47G  2.1G  96% /storage

The storage volume is 96% full. Check if garbage collection has been running:

$ kubectl logs -n harbor deploy/harbor-jobservice --since=168h | grep "gc"
2026-03-16T02:00:00Z [INFO] Starting garbage collection
2026-03-16T02:00:45Z [ERROR] GC failed: filesystem is almost full, refusing to operate
2026-03-16T02:00:45Z [INFO] GC completed with errors

Phase 2: Linux Ops Investigation (Root Cause)

The registry's persistent volume is almost full. Garbage collection has been failing for at least 3 days. Without GC, orphaned layers accumulate. But the real problem is worse — the filesystem filled up during a push operation, causing a partial write:

$ kubectl logs -n harbor harbor-registry-0 --since=72h | grep -i "error\|write\|disk"
2026-03-17T14:22:31Z [ERROR] blob upload failed: write /storage/docker/registry/v2/blobs/sha256/a8/a8f2b.../data: no space left on device
2026-03-17T14:22:31Z [WARN] partial blob written, metadata committed but data incomplete

Three days ago, a cache push partially succeeded — the manifest was updated to reference sha256:a8f2b... but the actual blob write failed due to disk full. The manifest now points to a non-existent blob. Every build that tries to use this cache hits the missing blob and fails.

Check what is consuming the space:

$ kubectl exec -it harbor-registry-0 -n harbor -- \
    find /storage -name "data" -size +100M | head -10 | xargs du -sh
12G     /storage/docker/registry/v2/blobs/sha256/3c/3c8d.../data
8.4G    /storage/docker/registry/v2/blobs/sha256/f1/f192.../data
5.2G    /storage/docker/registry/v2/blobs/sha256/b7/b7a4.../data

Old, untagged image layers consuming gigabytes. These should have been cleaned up by GC.

Domain Bridge: Why This Crossed Domains

Key insight: The symptom was a CI pipeline failure (devops_tooling), the root cause was a corrupted cache blob caused by a disk-full condition on the registry's persistent volume (linux_ops), and the fix requires running registry garbage collection and resizing the volume (kubernetes_ops). This is common because: container image registries are storage-intensive services. Without garbage collection, they fill their volumes with orphaned layers. A full disk can corrupt cache manifests, and CI pipelines break when their cache references become invalid.

Root Cause

The Harbor registry's 50GB persistent volume filled to 96%. Garbage collection had been failing for days due to the full filesystem. During a CI cache push, the disk ran out of space mid-write — the cache manifest was updated to reference a new blob, but the blob data was never fully written. All subsequent CI builds failed because the cache manifest pointed to a non-existent layer.