What is Kubernetes disaster recovery and backup?

Why Interviewers Ask This

Senior Kubernetes (K8s) engineers are expected to reason about architecture, performance, and edge cases. This question separates mid-level from senior candidates by testing deep system-level understanding.

Answer

Kubernetes disaster recovery encompasses multiple layers: etcd backup (cluster state): most critical — all resource definitions stored here. Backup strategy: scheduled CronJob running etcdctl snapshot save; ship snapshots to S3 with encryption; test restore regularly; for managed K8s (EKS/GKE/AKS) — provider manages etcd, but you still need application data backup. Velero (application backup): open-source tool for backing up and restoring Kubernetes cluster resources and persistent volumes. Features: scheduled backups, on-demand backups, selective backup (by namespace, label selector), cross-cluster restore, migrate to new cluster, disaster recovery. Installation: velero install --provider aws --plugins velero/velero-plugin-for-aws:v1.7.0 \ --bucket my-velero-bucket --secret-file ./credentials-velero \ --backup-location-config region=us-east-1 \ --snapshot-location-config region=us-east-1. Usage: velero backup create daily-backup --include-namespaces production --ttl 720h velero backup get velero restore create --from-backup daily-backup velero schedule create daily --schedule "0 1 * * *" --include-namespaces production. PV backup: Velero uses volume snapshots (cloud provider snapshots via CSI VolumeSnapshot API). For cross-region recovery, ensure snapshots are copied to the target region. RTO/RPO considerations: etcd restore requires cluster downtime; Velero restore requires cluster to exist. Multi-region active-active (no downtime) vs backup-restore (minutes to hours). Infrastructure as Code: if infrastructure defined in Terraform/CDK + manifests in Git, entire cluster can be recreated from code. Combine: IaC for cluster creation + Velero for application state + etcd snapshots for cluster resources. Runbooks: document step-by-step recovery procedures; test regularly with chaos engineering (Chaos Mesh, LitmusChaos).

Pro Tip

This topic has Kubernetes (K8s)-specific nuances that differ from general programming. Highlighting those nuances in your answer shows expertise rather than generic knowledge.