Portal | Level: L2: Operations | Topics: Kubernetes Core | Domain: Kubernetes

Runbook: Velero Backup & Restore (Application-Level DR)¶

Symptoms¶

Need to migrate workloads between clusters
Accidental namespace deletion
Need to restore specific applications, not the entire cluster
Disaster recovery for application state (PVCs, configs, secrets)

Fast Triage¶

# Check Velero status
velero version
velero get backup-locations
velero backup get

# Check latest backup
velero backup describe <latest-backup-name> --details

What Velero Does¶

Velero backs up Kubernetes resources and persistent volumes:

[Namespaces] + [Deployments] + [Services] + [ConfigMaps] + [Secrets] + [PVCs]
                                    |
                              [Object Storage (S3/GCS)]
                                    +
                              [Volume Snapshots]

Installation¶

# Install Velero CLI
curl -LO https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-amd64.tar.gz
tar xvf velero-v1.13.0-linux-amd64.tar.gz
sudo mv velero-v1.13.0-linux-amd64/velero /usr/local/bin/

# Install Velero server (AWS example)
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero

# Verify
kubectl get pods -n velero
velero get backup-locations

Backup Procedures¶

Full Cluster Backup¶

velero backup create full-backup-$(date +%Y%m%d) \
  --wait

Namespace Backup¶

velero backup create grokdevops-backup-$(date +%Y%m%d) \
  --include-namespaces grokdevops \
  --wait

Scheduled Backups¶

# Daily backup of grokdevops namespace, retain 7 days
velero schedule create grokdevops-daily \
  --schedule="0 2 * * *" \
  --include-namespaces grokdevops \
  --ttl 168h

# Weekly full backup, retain 30 days
velero schedule create weekly-full \
  --schedule="0 3 * * 0" \
  --ttl 720h

Backup with Volume Snapshots¶

velero backup create with-volumes-$(date +%Y%m%d) \
  --include-namespaces grokdevops \
  --snapshot-volumes=true \
  --wait

Restore Procedures¶

Restore a Namespace¶

# Restore grokdevops namespace from backup
velero restore create --from-backup grokdevops-backup-20240115 \
  --wait

# Check restore status
velero restore describe <restore-name> --details

Restore to a Different Namespace¶

velero restore create --from-backup grokdevops-backup-20240115 \
  --namespace-mappings grokdevops:grokdevops-restored \
  --wait

Restore Specific Resources¶

# Restore only deployments and services
velero restore create --from-backup grokdevops-backup-20240115 \
  --include-resources deployments,services \
  --wait

# Restore only a specific resource
velero restore create --from-backup grokdevops-backup-20240115 \
  --include-resources deployments \
  --selector app=grokdevops \
  --wait

Migrate to Another Cluster¶

[!WARNING] Cross-cluster restores are destructive. Restoring into a namespace that already has resources will overwrite them (with --existing-resource-policy=update) or fail silently on conflicts. Always restore to a new namespace or empty cluster first, verify the result, then cut over.

# Source cluster: create backup
velero backup create migration-$(date +%Y%m%d) \
  --include-namespaces grokdevops \
  --wait

# Target cluster: install Velero pointing to same bucket
velero install --provider aws --bucket velero-backups ...

# Target cluster: restore
velero restore create --from-backup migration-$(date +%Y%m%d) \
  --wait

Verification¶

# Check restore status
velero restore describe <restore-name>
# Expected: Phase: Completed

# Verify resources
kubectl get all -n grokdevops
kubectl get pvc -n grokdevops

# Verify application health
kubectl rollout status deployment/grokdevops -n grokdevops
curl -s http://grokdevops:8000/health

Monitoring¶

# List all backups with status
velero backup get

# Check for failed backups
velero backup get -o json | jq '.items[] | select(.status.phase != "Completed") | .metadata.name'

# Check schedule status
velero schedule get

Common Issues¶

Issue	Fix
Backup stuck in "InProgress"	Check Velero pod logs, verify S3 connectivity
Restore fails with "already exists"	Use `--existing-resource-policy=update`
PVC not restored	Verify snapshot-location is configured, CSI driver supports snapshots
Partial restore	Check `velero restore logs <name>` for skipped resources
Credentials expired	Update the Velero secret with new AWS/GCP credentials

Adversarial Interview Gauntlet (30 sequences) (Scenario, L2) — Kubernetes Core
Case Study: Alert Storm — Flapping Health Checks (Case Study, L2) — Kubernetes Core
Case Study: Canary Deploy Routing to Wrong Backend — Ingress Misconfigured (Case Study, L2) — Kubernetes Core
Case Study: CrashLoopBackOff No Logs (Case Study, L1) — Kubernetes Core
Case Study: DNS Looks Broken — TLS Expired, Fix Is Cert-Manager (Case Study, L2) — Kubernetes Core
Case Study: DaemonSet Blocks Eviction (Case Study, L2) — Kubernetes Core
Case Study: Deployment Stuck — ImagePull Auth Failure, Vault Secret Rotation (Case Study, L2) — Kubernetes Core
Case Study: Drain Blocked by PDB (Case Study, L2) — Kubernetes Core
Case Study: HPA Flapping — Metrics Server Clock Skew, Fix Is NTP (Case Study, L2) — Kubernetes Core
Case Study: ImagePullBackOff Registry Auth (Case Study, L1) — Kubernetes Core

Runbook: Velero Backup & Restore (Application-Level DR)¶

Symptoms¶

Fast Triage¶

What Velero Does¶

Installation¶

Backup Procedures¶

Full Cluster Backup¶

Namespace Backup¶

Scheduled Backups¶

Backup with Volume Snapshots¶

Restore Procedures¶

Restore a Namespace¶

Restore to a Different Namespace¶

Restore Specific Resources¶

Migrate to Another Cluster¶

Verification¶

Monitoring¶

Common Issues¶

Wiki Navigation¶

Pages that link here¶

Runbook: Velero Backup & Restore (Application-Level DR)¶

Symptoms¶

Fast Triage¶

What Velero Does¶

Installation¶

Backup Procedures¶

Full Cluster Backup¶

Namespace Backup¶

Scheduled Backups¶

Backup with Volume Snapshots¶

Restore Procedures¶

Restore a Namespace¶

Restore to a Different Namespace¶

Restore Specific Resources¶

Migrate to Another Cluster¶

Verification¶

Monitoring¶

Common Issues¶

Wiki Navigation¶

Related Content¶

Pages that link here¶