Portal | Level: L2: Operations | Topics: Kubernetes Core | Domain: Kubernetes
Runbook: Velero Backup & Restore (Application-Level DR)¶
Symptoms¶
- Need to migrate workloads between clusters
- Accidental namespace deletion
- Need to restore specific applications, not the entire cluster
- Disaster recovery for application state (PVCs, configs, secrets)
Fast Triage¶
# Check Velero status
velero version
velero get backup-locations
velero backup get
# Check latest backup
velero backup describe <latest-backup-name> --details
What Velero Does¶
Velero backs up Kubernetes resources and persistent volumes:
[Namespaces] + [Deployments] + [Services] + [ConfigMaps] + [Secrets] + [PVCs]
|
[Object Storage (S3/GCS)]
+
[Volume Snapshots]
Installation¶
# Install Velero CLI
curl -LO https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-amd64.tar.gz
tar xvf velero-v1.13.0-linux-amd64.tar.gz
sudo mv velero-v1.13.0-linux-amd64/velero /usr/local/bin/
# Install Velero server (AWS example)
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero
# Verify
kubectl get pods -n velero
velero get backup-locations
Backup Procedures¶
Full Cluster Backup¶
Namespace Backup¶
Scheduled Backups¶
# Daily backup of grokdevops namespace, retain 7 days
velero schedule create grokdevops-daily \
--schedule="0 2 * * *" \
--include-namespaces grokdevops \
--ttl 168h
# Weekly full backup, retain 30 days
velero schedule create weekly-full \
--schedule="0 3 * * 0" \
--ttl 720h
Backup with Volume Snapshots¶
velero backup create with-volumes-$(date +%Y%m%d) \
--include-namespaces grokdevops \
--snapshot-volumes=true \
--wait
Restore Procedures¶
Restore a Namespace¶
# Restore grokdevops namespace from backup
velero restore create --from-backup grokdevops-backup-20240115 \
--wait
# Check restore status
velero restore describe <restore-name> --details
Restore to a Different Namespace¶
velero restore create --from-backup grokdevops-backup-20240115 \
--namespace-mappings grokdevops:grokdevops-restored \
--wait
Restore Specific Resources¶
# Restore only deployments and services
velero restore create --from-backup grokdevops-backup-20240115 \
--include-resources deployments,services \
--wait
# Restore only a specific resource
velero restore create --from-backup grokdevops-backup-20240115 \
--include-resources deployments \
--selector app=grokdevops \
--wait
Migrate to Another Cluster¶
[!WARNING] Cross-cluster restores are destructive. Restoring into a namespace that already has resources will overwrite them (with
--existing-resource-policy=update) or fail silently on conflicts. Always restore to a new namespace or empty cluster first, verify the result, then cut over.
# Source cluster: create backup
velero backup create migration-$(date +%Y%m%d) \
--include-namespaces grokdevops \
--wait
# Target cluster: install Velero pointing to same bucket
velero install --provider aws --bucket velero-backups ...
# Target cluster: restore
velero restore create --from-backup migration-$(date +%Y%m%d) \
--wait
Verification¶
# Check restore status
velero restore describe <restore-name>
# Expected: Phase: Completed
# Verify resources
kubectl get all -n grokdevops
kubectl get pvc -n grokdevops
# Verify application health
kubectl rollout status deployment/grokdevops -n grokdevops
curl -s http://grokdevops:8000/health
Monitoring¶
# List all backups with status
velero backup get
# Check for failed backups
velero backup get -o json | jq '.items[] | select(.status.phase != "Completed") | .metadata.name'
# Check schedule status
velero schedule get
Common Issues¶
| Issue | Fix |
|---|---|
| Backup stuck in "InProgress" | Check Velero pod logs, verify S3 connectivity |
| Restore fails with "already exists" | Use --existing-resource-policy=update |
| PVC not restored | Verify snapshot-location is configured, CSI driver supports snapshots |
| Partial restore | Check velero restore logs <name> for skipped resources |
| Credentials expired | Update the Velero secret with new AWS/GCP credentials |
Wiki Navigation¶
Related Content¶
- Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
- Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
- Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
- Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
- Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
- Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
- Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
- Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
- Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
- Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core