Converting Control Plane
Talos version 0.9 runs Kubernetes control plane in a new way: static pods managed by Talos. Talos version 0.8 and below runs self-hosted control plane. After Talos OS upgrade to version 0.9 Kubernetes control plane should be converted to run as static pods.
This guide describes automated conversion script and also shows detailed manual conversion process.
Video Walkthrough
To see a live demo of this writeup, see the video below:
Automated Conversion
First, make sure all nodes are updated to Talos 0.9:
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
talos-default-master-1 Ready control-plane,master 58m v1.20.4 172.20.0.2 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4
talos-default-master-2 Ready control-plane,master 58m v1.20.4 172.20.0.3 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4
talos-default-master-3 Ready control-plane,master 58m v1.20.4 172.20.0.4 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4
talos-default-worker-1 Ready <none> 58m v1.20.4 172.20.0.5 <none> Talos (v0.9.0) 5.10.19-talos containerd://1.4.4
Start the conversion script:
$ talosctl -n <IP> convert-k8s
discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
current self-hosted status: true
gathering control plane configuration
aggregator CA key can't be recovered from bootkube-boostrapped control plane, generating new CA
patching master node "172.20.0.2" configuration
patching master node "172.20.0.3" configuration
patching master node "172.20.0.4" configuration
waiting for static pod definitions to be generated
waiting for manifests to be generated
Talos generated control plane static pod definitions and bootstrap manifests, please verify them with commands:
talosctl -n <master node IP> get StaticPods.kubernetes.talos.dev
talosctl -n <master node IP> get Manifests.kubernetes.talos.dev
in order to remove self-hosted control plane, pod-checkpointer component needs to be disabled
once pod-checkpointer is disabled, the cluster shouldn't be rebooted until the entire conversion process is complete
confirm disabling pod-checkpointer to proceed with control plane update [yes/no]:
Script stops at this point waiting for confirmation. Talos still runs self-hosted control plane, and static pods were not rendered yet.
As instructed by the script, please verify that static pod definitions are correct:
$ talosctl -n <IP> get staticpods -o yaml
node: 172.20.0.2
metadata:
namespace: controlplane
type: StaticPods.kubernetes.talos.dev
id: kube-apiserver
version: 1
phase: running
spec:
apiVersion: v1
kind: Pod
metadata:
annotations:
talos.dev/config-version: "2"
talos.dev/secrets-version: "1"
creationTimestamp: null
labels:
k8s-app: kube-apiserver
tier: control-plane
name: kube-apiserver
namespace: kube-system
spec:
containers:
- command:
...
Static pod definitions are generated from the machine configuration and should match pod template as generated by Talos on bootstrap of self-hosted control plane unless there were some manual changes applied to the daemonset specs after bootstrap. Talos patches the machine configuration with the container image versions scraped from the daemonset definition, fetches the service account key from Kubernetes secrets.
Aggregator CA can’t be recovered from the self-hosted control plane, so new CA gets generated. This is generally harmless and not visible from outside the cluster. The Aggregator CA is not the same CA as is used by Talos or Kubernetes standard API. It is a special PKI used for aggregating API extension services inside your cluster. If you have non-standard apiserver aggregations (fairly rare, and you should know if you do), then you may need to restart these services after the new CA is in place.
Verify that bootstrap manifests are correct:
$ talosctl -n <IP> get manifests
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 controlplane Manifest 00-kubelet-bootstrapping-token 1
172.20.0.2 controlplane Manifest 01-csr-approver-role-binding 1
172.20.0.2 controlplane Manifest 01-csr-node-bootstrap 1
172.20.0.2 controlplane Manifest 01-csr-renewal-role-binding 1
172.20.0.2 controlplane Manifest 02-kube-system-sa-role-binding 1
172.20.0.2 controlplane Manifest 03-default-pod-security-policy 1
172.20.0.2 controlplane Manifest 05-https://docs.projectcalico.org/manifests/calico.yaml 1
172.20.0.2 controlplane Manifest 10-kube-proxy 1
172.20.0.2 controlplane Manifest 11-core-dns 1
172.20.0.2 controlplane Manifest 11-core-dns-svc 1
172.20.0.2 controlplane Manifest 11-kube-config-in-cluster 1
Make sure that manifests and static pods are correct across all control plane nodes, as each node reconciles control plane state on its own. For example, CNI configuration in machine config should be in sync across all the nodes. Talos nodes try to create any missing Kubernetes resources from the manifests, but it never updates or deletes existing resources.
If something looks wrong, script can be aborted and machine configuration should be updated to fix the problem. Once configuration is updated, the script can be restarted.
If static pod definitions and manifests look good, confirm next step to disable pod-checkpointer
:
$ talosctl -n <IP> convert-k8s
...
confirm disabling pod-checkpointer to proceed with control plane update [yes/no]: yes
disabling pod-checkpointer
deleting daemonset "pod-checkpointer"
checking for active pod checkpoints
2021/03/09 23:37:25 retrying error: found 3 active pod checkpoints: [pod-checkpointer-655gc-talos-default-master-3 pod-checkpointer-pw6mv-talos-default-master-1 pod-checkpointer-zdw9z-talos-default-master-2]
2021/03/09 23:42:25 retrying error: found 1 active pod checkpoints: [pod-checkpointer-pw6mv-talos-default-master-1]
confirm applying static pod definitions and manifests [yes/no]:
Self-hosted control plane runs pod-checkpointer
to work around issues with control plane availability.
It should be disabled before conversion starts to allow self-hosted control plane to be removed.
It takes around 5 minutes for the pod-checkpointer
to be fully disabled.
Script verifies that all checkpoints are removed before proceeding.
This last confirmation before proceeding is at the point when there is no way to keep running self-hosted control plane: static pods are released, bootstrap manifests are applied, self-hosted control plane is removed.
$ talosctl -n <IP> convert-k8s
...
confirm applying static pod definitions and manifests [yes/no]: yes
removing self-hosted initialized key
waiting for static pods for "kube-apiserver" to be present in the API server state
waiting for static pods for "kube-controller-manager" to be present in the API server state
waiting for static pods for "kube-scheduler" to be present in the API server state
deleting daemonset "kube-apiserver"
waiting for static pods for "kube-apiserver" to be present in the API server state
deleting daemonset "kube-controller-manager"
waiting for static pods for "kube-controller-manager" to be present in the API server state
deleting daemonset "kube-scheduler"
waiting for static pods for "kube-scheduler" to be present in the API server state
conversion process completed successfully
As soon as the control plane static pods are rendered, the kubelet starts the control plane static pods.
It is expected that the pods for kube-apiserver
will crash initially.
Only one kube-apiserver
can be bound to the host Node
’s port 6443 at a time.
Eventually, the old kube-apiserver
will be killed, and the new one will be able to start.
This is all handled automatically.
The script will continue by removing each self-hosted daemonset and verifying that static pods are ready and healthy.
Manual Conversion
Check that Talos runs self-hosted control plane:
$ talosctl -n <CONTROL_PLANE_IP> get bs
NODE NAMESPACE TYPE ID VERSION SELF HOSTED
172.20.0.2 runtime BootstrapStatus control-plane 2 true
Talos machine configuration need to be updated to the 0.9 format; there are two new required machine configuration settings:
.cluster.serviceAccount
is the service account PEM-encoded private key..cluster.aggregatorCA
is the aggregator CA forkube-apiserver
(certficiate and private key).
Current service account can be fetched from the Kubernetes secrets:
$ kubectl -n kube-system get secrets kube-controller-manager -o jsonpath='{.data.service\-account\.key}'
LS0tLS1CRUdJTiBSU0EgUFJJVkFURS...
All control plane node machine configurations should be patched with the service account key:
$ talosctl -n <CONTROL_PLANE_IP1>,<CONTROL_PLANE_IP2>,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/serviceAccount", "value": {"key": "LS0tLS1CRUdJTiBSU0EgUFJJVkFURS..."}}]'
patched mc at the node 172.20.0.2
Aggregator CA can be generated using OpenSSL or any other certificate generation tools: RSA or ECDSA certificate with CN front-proxy
valid for 10 years.
PEM-encoded CA certificate and key should be base64-encoded and patched into the machine config at path /cluster/aggregatorCA
:
$ talosctl -n <CONTROL_PLANE_IP1>,<CONTROL_PLANE_IP2>,... patch mc --immediate -p '[{"op": "add", "path": "/cluster/aggregatorCA", "value": {"crt": "S0tLS1CRUdJTiBDRVJUSUZJQ...", "key": "LS0tLS1CRUdJTiBFQy..."}}]'
patched mc at the node 172.20.0.2
At this point static pod definitions and bootstrap manifests should be rendered, please see “Automated Conversion” on how to verify generated objects. Feel free to continue to refine your machine configuration until the generated static pod definitions and bootstrap manifests look good.
If static pod definitions are not generated, check logs with talosctl -n <IP> logs controller-runtime
.
Disable pod-checkpointer
with:
$ kubectl -n kube-system delete ds pod-checkpointer
daemonset.apps "pod-checkpointer" deleted
Wait for all pod checkpoints to be removed:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
...
pod-checkpointer-8q2lh-talos-default-master-2 1/1 Running 0 3m34s
pod-checkpointer-nnm5w-talos-default-master-3 1/1 Running 0 3m24s
pod-checkpointer-qnmdt-talos-default-master-1 1/1 Running 0 2m21s
Pod checkpoints have annotation checkpointer.alpha.coreos.com/checkpoint-of
.
Once all the pod checkpoints are removed (it takes 5 minutes for the checkpoints to be removed), proceed by removing self-hosted initialized key:
talosctl -n <CONTROL_PLANE_IP> convert-k8s --remove-initialized-key
Talos controllers will now render static pod definitions, and the kubelet will launch any resulting static pods.
Once static pods are visible in kubectl get pods -n kube-system
output, proceed by removing each of the self-hosted daemonsets:
$ kubectl -n kube-system delete daemonset kube-apiserver
daemonset.apps "kube-apiserver" deleted
Make sure static pods for kube-apiserver
got started successfully, pods are running and ready.
Proceed by deleting kube-controller-manager
and kube-scheduler
daemonsets, verifying that static pods are running between each step:
$ kubectl -n kube-system delete daemonset kube-controller-manager
daemonset.apps "kube-controller-manager" deleted
$ kubectl -n kube-system delete daemonset kube-scheduler
daemonset.apps "kube-scheduler" deleted