Remediation: CI Pipeline Fails, Docker Layer Cache Corruption, Fix Is Registry GC¶
Immediate Fix (Kubernetes Ops — Domain C)¶
The fix involves Kubernetes PVC resizing, registry garbage collection, and cache rebuild.
Step 1: Resize the persistent volume¶
# Check if the StorageClass allows expansion
$ kubectl get storageclass standard -o jsonpath='{.allowVolumeExpansion}'
true
# Resize the PVC
$ kubectl patch pvc harbor-registry-data -n harbor \
-p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'
persistentvolumeclaim/harbor-registry-data patched
# Verify expansion
$ kubectl get pvc harbor-registry-data -n harbor
NAME STATUS VOLUME CAPACITY ACCESS MODES
harbor-registry-data Bound pv-abc123 100Gi RWO
Step 2: Run garbage collection¶
# Trigger GC via Harbor API
$ curl -X POST "https://harbor.internal/api/v2.0/system/gc/schedule" \
-H "Content-Type: application/json" \
-H "Authorization: Basic $(echo -n admin:Harbor12345 | base64)" \
-d '{"schedule":{"type":"Manual"},"parameters":{"delete_untagged":true}}'
# Monitor GC progress
$ kubectl logs -n harbor deploy/harbor-jobservice -f | grep gc
2026-03-19T16:00:00Z [INFO] Starting garbage collection
2026-03-19T16:02:15Z [INFO] GC completed: deleted 847 blobs, freed 28.3GB
Step 3: Delete the corrupted cache tag and rebuild¶
# Delete the corrupted cache tag
$ curl -X DELETE "https://harbor.internal/v2/app/manifests/cache" \
-H "Accept: application/vnd.oci.image.index.v1+json" \
-H "Authorization: Basic $(echo -n admin:Harbor12345 | base64)"
# Trigger a CI build with --no-cache to rebuild the cache from scratch
# In the CI workflow, manually trigger with a parameter or push a no-op commit
$ git commit --allow-empty -m "ci: rebuild docker cache"
$ git push origin main
The CI build will run with --no-cache for this one build (since the cache tag was deleted), then push a fresh cache for subsequent builds.
Verification¶
Domain A (DevOps Tooling) — CI builds passing¶
# Check the latest CI run
$ gh run list --workflow=ci.yml --limit=3
STATUS TITLE BRANCH EVENT
✓ ci: rebuild docker cache main push
✗ fix: update payment timeout main push # (old failure)
✗ feat: add retry logic main push # (old failure)
Domain B (Linux Ops) — Registry storage healthy¶
$ kubectl exec -it harbor-registry-0 -n harbor -- df -h /storage
Filesystem Size Used Avail Use% Mounted on
/dev/sdc 100G 19G 78G 20% /storage
Domain C (Kubernetes) — PVC resized, GC scheduled¶
$ kubectl get pvc harbor-registry-data -n harbor
NAME STATUS VOLUME CAPACITY ACCESS MODES
harbor-registry-data Bound pv-abc123 100Gi RWO
# Verify GC is scheduled
$ curl -s "https://harbor.internal/api/v2.0/system/gc/schedule" \
-H "Authorization: Basic $(echo -n admin:Harbor12345 | base64)" | jq .
{
"schedule": {
"type": "Daily",
"cron": "0 2 * * *"
}
}
Prevention¶
- Monitoring: Add storage utilization alerts for all registry PVCs. Fire WARNING at 70%, CRITICAL at 85%.
- alert: RegistryStorageHigh
expr: kubelet_volume_stats_used_bytes{namespace="harbor"} / kubelet_volume_stats_capacity_bytes > 0.7
for: 30m
labels:
severity: warning
-
Runbook: Schedule registry garbage collection to run daily. Always run GC before any CI cache rebuild. Monitor GC job success — a failing GC is a leading indicator of storage problems.
-
Architecture: Move the registry to object storage (S3/GCS) backend instead of PVC to eliminate volume capacity limits. Set up image retention policies by tag age and count.