Remediation: CI Pipeline Fails, Docker Layer Cache Corruption, Fix Is Registry GC¶

Immediate Fix (Kubernetes Ops — Domain C)¶

The fix involves Kubernetes PVC resizing, registry garbage collection, and cache rebuild.

Step 1: Resize the persistent volume¶

# Check if the StorageClass allows expansion
$ kubectl get storageclass standard -o jsonpath='{.allowVolumeExpansion}'
true

# Resize the PVC
$ kubectl patch pvc harbor-registry-data -n harbor \
    -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
persistentvolumeclaim/harbor-registry-data patched

# Verify expansion
$ kubectl get pvc harbor-registry-data -n harbor
NAME                   STATUS   VOLUME     CAPACITY   ACCESS MODES
harbor-registry-data   Bound    pv-abc123  100Gi      RWO

Step 2: Run garbage collection¶

# Trigger GC via Harbor API
$ curl -X POST "https://harbor.internal/api/v2.0/system/gc/schedule" \
    -H "Content-Type: application/json" \
    -H "Authorization: Basic $(echo -n admin:Harbor12345 | base64)" \
    -d '{"schedule":{"type":"Manual"},"parameters":{"delete_untagged":true}}'

# Monitor GC progress
$ kubectl logs -n harbor deploy/harbor-jobservice -f | grep gc
2026-03-19T16:00:00Z [INFO] Starting garbage collection
2026-03-19T16:02:15Z [INFO] GC completed: deleted 847 blobs, freed 28.3GB

Step 3: Delete the corrupted cache tag and rebuild¶

# Delete the corrupted cache tag
$ curl -X DELETE "https://harbor.internal/v2/app/manifests/cache" \
    -H "Accept: application/vnd.oci.image.index.v1+json" \
    -H "Authorization: Basic $(echo -n admin:Harbor12345 | base64)"

# Trigger a CI build with --no-cache to rebuild the cache from scratch
# In the CI workflow, manually trigger with a parameter or push a no-op commit
$ git commit --allow-empty -m "ci: rebuild docker cache"
$ git push origin main

The CI build will run with --no-cache for this one build (since the cache tag was deleted), then push a fresh cache for subsequent builds.

Verification¶

Domain A (DevOps Tooling) — CI builds passing¶

# Check the latest CI run
$ gh run list --workflow=ci.yml --limit=3
STATUS  TITLE                           BRANCH  EVENT
✓       ci: rebuild docker cache        main    push
✗       fix: update payment timeout     main    push    # (old failure)
✗       feat: add retry logic           main    push    # (old failure)

Domain B (Linux Ops) — Registry storage healthy¶

$ kubectl exec -it harbor-registry-0 -n harbor -- df -h /storage
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        100G   19G   78G  20% /storage

Domain C (Kubernetes) — PVC resized, GC scheduled¶

$ kubectl get pvc harbor-registry-data -n harbor
NAME                   STATUS   VOLUME     CAPACITY   ACCESS MODES
harbor-registry-data   Bound    pv-abc123  100Gi      RWO

# Verify GC is scheduled
$ curl -s "https://harbor.internal/api/v2.0/system/gc/schedule" \
    -H "Authorization: Basic $(echo -n admin:Harbor12345 | base64)" | jq .
{
  "schedule": {
    "type": "Daily",
    "cron": "0 2 * * *"
  }
}

Prevention¶

Monitoring: Add storage utilization alerts for all registry PVCs. Fire WARNING at 70%, CRITICAL at 85%.

- alert: RegistryStorageHigh
  expr: kubelet_volume_stats_used_bytes{namespace="harbor"} / kubelet_volume_stats_capacity_bytes > 0.7
  for: 30m
  labels:
    severity: warning

Runbook: Schedule registry garbage collection to run daily. Always run GC before any CI cache rebuild. Monitor GC job success — a failing GC is a leading indicator of storage problems.
Architecture: Move the registry to object storage (S3/GCS) backend instead of PVC to eliminate volume capacity limits. Set up image retention policies by tag age and count.