What is Kubernetes disaster recovery and backup?
Why Interviewers Ask This
Senior Kubernetes (K8s) engineers are expected to reason about architecture, performance, and edge cases. This question separates mid-level from senior candidates by testing deep system-level understanding.
Answer
Kubernetes disaster recovery encompasses multiple layers: etcd backup (cluster state): most critical — all resource definitions stored here. Backup strategy: scheduled CronJob running etcdctl snapshot save; ship snapshots to S3 with encryption; test restore regularly; for managed K8s (EKS/GKE/AKS) — provider manages etcd, but you still need application data backup. Velero (application backup): open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes. Features: scheduled backups, on-demand backups, selective backup (by namespace, label selector), cross-cluster restore, migrate to new cluster, disaster recovery. Installation: velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.7.0 \ --bucket my-velero-bucket --secret-file ./credentials-velero \ --backup-location-config region=us-east-1 \ --snapshot-location-config region=us-east-1. Usage: velero backup create daily-backup --include-namespaces production --ttl 720h velero backup get velero restore create --from-backup daily-backup velero schedule create daily --schedule "0 1 * * *" --include-namespaces production. PV backup: Velero uses volume snapshots (cloud provider snapshots via CSI VolumeSnapshot API). For cross-region recovery, ensure snapshots are copied to the target region. RTO/RPO considerations: etcd restore requires cluster downtime; Velero restore requires cluster to exist. Multi-region active-active (no downtime) vs backup-restore (minutes to hours). Infrastructure as Code: if infrastructure defined in Terraform/CDK + manifests in Git, entire cluster can be recreated from code. Combine: IaC for cluster creation + Velero for application state + etcd snapshots for cluster resources. Runbooks: document step-by-step recovery procedures; test regularly with chaos engineering (Chaos Mesh, LitmusChaos).
Pro Tip
This topic has Kubernetes (K8s)-specific nuances that differ from general programming. Highlighting those nuances in your answer shows expertise rather than generic knowledge.
Previous
What is Kubernetes application delivery with Kustomize?
Next
What are Kubernetes admission controllers?