Disaster Recovery

Procedure for snapshotting etcd database and recovering from catastrophic control plane failure.

etcd database backs Kubernetes control plane state, so if the etcd service is unavailable, the Kubernetes control plane goes down, and the cluster is not recoverable until etcd is recovered. etcd builds around the consensus protocol Raft, so highly-available control plane clusters can tolerate the loss of nodes so long as more than half of the members are running and reachable. For a three control plane node Talos cluster, this means that the cluster tolerates a failure of any single node, but losing more than one node at the same time leads to complete loss of service. Because of that, it is important to take routine backups of etcd state to have a snapshot to recover the cluster from in case of catastrophic failure.

Backup

Snapshotting etcd Database

Create a consistent snapshot of etcd database with talosctl etcd snapshot command:

$ talosctl -n <IP> etcd snapshot db.snapshot
etcd snapshot saved to "db.snapshot" (2015264 bytes)
snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136

Note: filename db.snapshot is arbitrary.

This database snapshot can be taken on any healthy control plane node (with IP address <IP> in the example above), as all etcd instances contain exactly same data. It is recommended to configure etcd snapshots to be created on some schedule to allow point-in-time recovery using the latest snapshot.

Disaster Database Snapshot

If the etcd cluster is not healthy (for example, if quorum has already been lost), the talosctl etcd snapshot command might fail. In that case, copy the database snapshot directly from the control plane node:

talosctl -n <IP> cp /var/lib/etcd/member/snap/db .

This snapshot might not be fully consistent (if the etcd process is running), but it allows for disaster recovery when latest regular snapshot is not available.

Machine Configuration

Machine configuration might be required to recover the node after hardware failure. Backup Talos node machine configuration with the command:

talosctl -n IP get mc v1alpha1 -o yaml | yq eval '.spec' -

Recovery

Before starting a disaster recovery procedure, make sure that etcd cluster can’t be recovered:

  • get etcd cluster member list on all healthy control plane nodes with talosctl -n IP etcd members command and compare across all members.
  • query etcd health across control plane nodes with talosctl -n IP service etcd.

If the quorum can be restored, restoring quorum might be a better strategy than performing full disaster recovery procedure.

Latest Etcd Snapshot

Get hold of the latest etcd database snapshot. If a snapshot is not fresh enough, create a database snapshot (see above), even if the etcd cluster is unhealthy.

Init Node

Make sure that there are no control plane nodes with machine type init:

$ talosctl -n <IP1>,<IP2>,... get machinetype
NODE         NAMESPACE   TYPE          ID             VERSION   TYPE
172.20.0.2   config      MachineType   machine-type   2         controlplane
172.20.0.4   config      MachineType   machine-type   2         controlplane
172.20.0.3   config      MachineType   machine-type   2         controlplane

Init node type is deprecated, and are incompatible with etcd recovery procedure. init node can be converted to controlplane type with talosctl edit mc --mode=staged command followed by node reboot with talosctl reboot command.

Preparing Control Plane Nodes

If some control plane nodes experienced hardware failure, replace them with new nodes.

Use machine configuration backup to re-create the nodes with the same secret material and control plane settings to allow workers to join the recovered control plane.

If a control plane node is up but etcd isn’t, wipe the node’s EPHEMERAL partition to remove the etcd data directory (make sure a database snapshot is taken before doing this):

talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL

At this point, all control plane nodes should boot up, and etcd service should be in the Preparing state.

The Kubernetes control plane endpoint should be pointed to the new control plane nodes if there were changes to the node addresses.

Recovering from the Backup

Make sure all etcd service instances are in Preparing state:

$ talosctl -n <IP> service etcd
NODE     172.20.0.2
ID       etcd
STATE    Preparing
HEALTH   ?
EVENTS   [Preparing]: Running pre state (17s ago)
         [Waiting]: Waiting for service "cri" to be "up", time sync (18s ago)
         [Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", time sync (20s ago)

Execute the bootstrap command against any control plane node passing the path to the etcd database snapshot:

$ talosctl -n <IP> bootstrap --recover-from=./db.snapshot
recovering from snapshot "./db.snapshot": hash c25fd181, revision 4193, total keys 1287, total size 3035136

Note: if database snapshot was copied out directly from the etcd data directory using talosctl cp, add flag --recover-skip-hash-check to skip integrity check on restore.

Talos node should print matching information in the kernel log:

recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136
{"level":"info","msg":"restoring snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/li}
{"level":"info","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":3360}
{"level":"info","msg":"added member","cluster-id":"a3390e43eb5274e2","local-member-id":"0","added-peer-id":"eb4f6f534361855e","added-peer-peer-urls":["https:/}
{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}

Now etcd service should become healthy on the bootstrap node, Kubernetes control plane components should start and control plane endpoint should become available. Remaining control plane nodes join etcd cluster once control plane endpoint is up.

Single Control Plane Node Cluster

This guide applies to the single control plane clusters as well. In fact, it is much more important to take regular snapshots of the etcd database in single control plane node case, as loss of the control plane node might render the whole cluster irrecoverable without a backup.

Last modified September 27, 2024: feat: prepare for Talos 1.9 (392c4798f)