This is the multi-page printable view of this section. Click here to print.
Advanced Guides
- 1: Advanced Networking
- 2: Air-gapped Environments
- 3: Building Custom Talos Images
- 4: CA Rotation
- 5: Customizing the Kernel
- 6: Developing Talos
- 7: Disaster Recovery
- 8: Egress Domains
- 9: etcd Maintenance
- 10: Extension Services
- 11: Machine Configuration OAuth2 Authentication
- 12: Metal Network Configuration
- 13: Migrating from Kubeadm
- 14: Overlays
- 15: Proprietary Kernel Modules
- 16: Static Pods
- 17: Talos API access from Kubernetes
- 18: Verifying Images
- 19: Watchdog Timers
1 - Advanced Networking
Static Addressing
Static addressing is comprised of specifying addresses
, routes
( remember to add your default gateway ), and interface
.
Most likely you’ll also want to define the nameservers
so you have properly functioning DNS.
machine:
network:
hostname: talos
nameservers:
- 10.0.0.1
interfaces:
- interface: eth0
addresses:
- 10.0.0.201/8
mtu: 8765
routes:
- network: 0.0.0.0/0
gateway: 10.0.0.1
- interface: eth1
ignore: true
time:
servers:
- time.cloudflare.com
Additional Addresses for an Interface
In some environments you may need to set additional addresses on an interface. In the following example, we set two additional addresses on the loopback interface.
machine:
network:
interfaces:
- interface: lo
addresses:
- 192.168.0.21/24
- 10.2.2.2/24
Bonding
The following example shows how to create a bonded interface.
machine:
network:
interfaces:
- interface: bond0
dhcp: true
bond:
mode: 802.3ad
lacpRate: fast
xmitHashPolicy: layer3+4
miimon: 100
updelay: 200
downdelay: 200
interfaces:
- eth0
- eth1
Setting Up a Bridge
The following example shows how to set up a bridge between two interfaces with an assigned static address.
machine:
network:
interfaces:
- interface: br0
addresses:
- 192.168.0.42/24
bridge:
stp:
enabled: true
interfaces:
- eth0
- eth1
VLANs
To setup vlans on a specific device use an array of VLANs to add. The master device may be configured without addressing by setting dhcp to false.
machine:
network:
interfaces:
- interface: eth0
dhcp: false
vlans:
- vlanId: 100
addresses:
- "192.168.2.10/28"
routes:
- network: 0.0.0.0/0
gateway: 192.168.2.1
2 - Air-gapped Environments
In this guide we will create a Talos cluster running in an air-gapped environment with all the required images being pulled from an internal registry.
We will use the QEMU provisioner available in talosctl
to create a local cluster, but the same approach could be used to deploy Talos in bigger air-gapped networks.
Requirements
The follow are requirements for this guide:
- Docker 18.03 or greater
- Requirements for the Talos QEMU cluster
Identifying Images
In air-gapped environments, access to the public Internet is restricted, so Talos can’t pull images from public Docker registries (docker.io
, ghcr.io
, etc.)
We need to identify the images required to install and run Talos.
The same strategy can be used for images required by custom workloads running on the cluster.
The talosctl image default
command provides a list of default images used by the Talos cluster (with default configuration
settings).
To print the list of images, run:
talosctl image default
This list contains images required by a default deployment of Talos. There might be additional images required for the workloads running on this cluster, and those should be added to this list.
Preparing the Internal Registry
As access to the public registries is restricted, we have to run an internal Docker registry. In this guide, we will launch the registry on the same machine using Docker:
$ docker run -d -p 6000:5000 --restart always --name registry-airgapped registry:2
1bf09802bee1476bc463d972c686f90a64640d87dacce1ac8485585de69c91a5
This registry will be accepting connections on port 6000 on the host IPs. The registry is empty by default, so we have fill it with the images required by Talos.
First, we pull all the images to our local Docker daemon:
$ for image in `talosctl image default`; do docker pull $image; done
v0.15.1: Pulling from coreos/flannel
Digest: sha256:9a296fbb67790659adc3701e287adde3c59803b7fcefe354f1fc482840cdb3d9
...
All images are now stored in the Docker daemon store:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
gcr.io/etcd-development/etcd v3.5.3 604d4f022632 6 days ago 181MB
ghcr.io/siderolabs/install-cni v1.0.0-2-gc5d3ab0 4729e54f794d 6 days ago 76MB
...
Now we need to re-tag them so that we can push them to our local registry.
We are going to replace the first component of the image name (before the first slash) with our registry endpoint 127.0.0.1:6000
:
$ for image in `talosctl image default`; do \
docker tag $image `echo $image | sed -E 's#^[^/]+/#127.0.0.1:6000/#'`; \
done
As the next step, we push images to the internal registry:
$ for image in `talosctl image default`; do \
docker push `echo $image | sed -E 's#^[^/]+/#127.0.0.1:6000/#'`; \
done
We can now verify that the images are pushed to the registry:
$ curl http://127.0.0.1:6000/v2/_catalog
{"repositories":["coredns/coredns","coreos/flannel","etcd-development/etcd","kube-apiserver","kube-controller-manager","kube-proxy","kube-scheduler","pause","siderolabs/install-cni","siderolabs/installer","siderolabs/kubelet"]}
Note: images in the registry don’t have the registry endpoint prefix anymore.
Launching Talos in an Air-gapped Environment
For Talos to use the internal registry, we use the registry mirror feature to redirect all image pull requests to the internal registry. This means that the registry endpoint (as the first component of the image reference) gets ignored, and all pull requests are sent directly to the specified endpoint.
We are going to use a QEMU-based Talos cluster for this guide, but the same approach works with Docker-based clusters as well. As QEMU-based clusters go through the Talos install process, they can be used better to model a real air-gapped environment.
Identify all registry prefixes from talosctl image default
, for example:
docker.io
gcr.io
ghcr.io
registry.k8s.io
The talosctl cluster create
command provides conveniences for common configuration options.
The only required flag for this guide is --registry-mirror <endpoint>=http://10.5.0.1:6000
which redirects every pull request to the internal registry, this flag
needs to be repeated for each of the identified registry prefixes above.
The endpoint being used is 10.5.0.1
, as this is the default bridge interface address which will be routable from the QEMU VMs (127.0.0.1
IP will be pointing to the VM itself).
$ sudo --preserve-env=HOME talosctl cluster create --provisioner=qemu --install-image=ghcr.io/siderolabs/installer:v1.8.0 \
--registry-mirror docker.io=http://10.5.0.1:6000 \
--registry-mirror gcr.io=http://10.5.0.1:6000 \
--registry-mirror ghcr.io=http://10.5.0.1:6000 \
--registry-mirror registry.k8s.io=http://10.5.0.1:6000 \
validating CIDR and reserving IPs
generating PKI and tokens
creating state directory in "/home/user/.talos/clusters/talos-default"
creating network talos-default
creating load balancer
creating dhcpd
creating master nodes
creating worker nodes
waiting for API
...
Note:
--install-image
should match the image which was copied into the internal registry in the previous step.
You can be verify that the cluster is air-gapped by inspecting the registry logs: docker logs -f registry-airgapped
.
Closing Notes
Running in an air-gapped environment might require additional configuration changes, for example using custom settings for DNS and NTP servers.
When scaling this guide to the bare-metal environment, following Talos config snippet could be used as an equivalent of the --registry-mirror
flag above:
machine:
...
registries:
mirrors:
docker.io:
endpoints:
- http://10.5.0.1:6000/
gcr.io:
endpoints:
- http://10.5.0.1:6000/
ghcr.io:
endpoints:
- http://10.5.0.1:6000/
registry.k8s.io:
endpoints:
- http://10.5.0.1:6000/
...
Other implementations of Docker registry can be used in place of the Docker registry
image used above to run the registry.
If required, auth can be configured for the internal registry (and custom TLS certificates if needed).
Please see pull-through cache guide for an example using Harbor container registry with Talos.
3 - Building Custom Talos Images
There might be several reasons to build Talos images from source:
- verifying the image integrity
- building an image with custom configuration
Checkout Talos Source
git clone https://github.com/siderolabs/talos.git
If building for a specific release, checkout the corresponding tag:
git checkout v1.8.0
Set up the Build Environment
See Developing Talos for details on setting up the buildkit builder.
Architectures
By default, Talos builds for linux/amd64
, but you can customize that by passing PLATFORM
variable to make
:
make <target> PLATFORM=linux/arm64 # build for arm64 only
make <target> PLATFORM=linux/arm64,linux/amd64 # build for arm64 and amd64, container images will be multi-arch
Custom PKGS
When customizing Linux kernel, the source for the siderolabs/pkgs
repository can
be overridden with:
- if you built and pushed only a custom
kernel
package, the reference can be overridden withPKG_KERNEL
variable:make <target> PKG_KERNEL=<registry>/<username>/kernel:<tag>
- if any other single package was customized, the reference can be overridden with
PKG_<pkg>
(e.g.PKG_IPTABLES
) variable:make <target> PKG_<pkg>=<registry>/<username>/<pkg>:<tag>
- if the full
pkgs
repository was built and pushed, the references can be overridden withPKGS_PREFIX
andPKGS
variables:make <target> PKGS_PREFIX=<registry>/<username> PKGS=<tag>
Customizations
Some of the build parameters can be customized by passing environment variables to make
, e.g. GOAMD64=v1
can be used to build
Talos images compatible with old AMD64 CPUs:
make <target> GOAMD64=v1
Building Kernel and Initramfs
The most basic boot assets can be built with:
make kernel initramfs
Build result will be stored as _out/vmlinuz-<arch>
and _out/initramfs-<arch>.xz
.
Building Container Images
Talos container images should be pushed to the registry as the result of the build process.
The default settings are:
IMAGE_REGISTRY
is set toghcr.io
USERNAME
is set to thesiderolabs
(or value of environment variableUSERNAME
if it is set)
The image can be pushed to any registry you have access to, but the access credentials should be stored in ~/.docker/config.json
file (e.g. with docker login
).
Building and pushing the image can be done with:
make installer PUSH=true IMAGE_REGISTRY=docker.io USERNAME=<username> # ghcr.io/siderolabs/installer
make imager PUSH=true IMAGE_REGISTRY=docker.io USERNAME=<username> # ghcr.io/siderolabs/imager
The local registry running on 127.0.0.1:5005
can be used as well to avoid pushing/pulling over the network:
make installer PUSH=true REGISTRY=127.0.0.1:5005
When building imager
container, by default Talos will include the boot assets for both amd64
and arm64
architectures, if building only for single architecture, specify INSTALLER_ARCH
variable:
make imager INSTALLER_ARCH=targetarch PLATFORM=linux/amd64
Building ISO
The ISO image is built with the help of imager
container image, by default ghcr.io/siderolabs/imager
will be used with the matching tag:
make iso
The ISO image will be stored as _out/talos-<arch>.iso
.
If ISO image should be built with the custom imager
image, it can be specified with IMAGE_REGISTRY
/USERNAME
variables:
make iso IMAGE_REGISTRY=docker.io USERNAME=<username>
Building Disk Images
The disk image is built with the help of imager
container image, by default ghcr.io/siderolabs/imager
will be used with the matching tag:
make image-metal
Available disk images are encoded in the image-%
target, e.g. make image-aws
.
Same as with ISO image, the custom imager
image can be specified with IMAGE_REGISTRY
/USERNAME
variables.
4 - CA Rotation
In general, you almost never need to rotate the root CA certificate and key for the Talos API and Kubernetes API. Talos sets up root certificate authorities with the lifetime of 10 years, and all Talos and Kubernetes API certificates are issued by these root CAs. So the rotation of the root CA is only needed if:
- you suspect that the private key has been compromised;
- you want to revoke access to the cluster for a leaked
talosconfig
orkubeconfig
; - once in 10 years.
Overview
There are some details which make Talos and Kubernetes API root CA rotation a bit different, but the general flow is the same:
- generate new CA certificate and key;
- add new CA certificate as ‘accepted’, so new certificates will be accepted as valid;
- swap issuing CA to the new one, old CA as accepted;
- refresh all certificates in the cluster;
- remove old CA from ‘accepted’.
At the end of the flow, old CA is completely removed from the cluster, so all certificates issued by it will be considered invalid.
Both rotation flows are described in detail below.
Talos API
Automated Talos API CA Rotation
Talos API CA rotation doesn’t interrupt connections within the cluster, and it doesn’t require a reboot of the nodes.
Run the following command in dry-run mode to see the steps which will be taken:
$ talosctl -n <CONTROLPLANE> rotate-ca --dry-run=true --talos=true --kubernetes=false
> Starting Talos API PKI rotation, dry-run mode true...
> Using config context: "talos-default"
> Using Talos API endpoints: ["172.20.0.2"]
> Cluster topology:
- control plane nodes: ["172.20.0.2"]
- worker nodes: ["172.20.0.3"]
> Current Talos CA:
...
No changes will be done to the cluster in dry-run mode, so you can safely run it to see the steps.
Before proceeding, make sure that you can capture the output of talosctl
command, as it will contain the new CA certificate and key.
Record a list of Talos API users to make sure they can all be updated with new talosconfig
.
Run the following command to rotate the Talos API CA:
$ talosctl -n <CONTROLPLANE> rotate-ca --dry-run=false --talos=true --kubernetes=false
> Starting Talos API PKI rotation, dry-run mode false...
> Using config context: "talos-default-268"
> Using Talos API endpoints: ["172.20.0.2"]
> Cluster topology:
- control plane nodes: ["172.20.0.2"]
- worker nodes: ["172.20.0.3"]
> Current Talos CA:
...
> New Talos CA:
...
> Generating new talosconfig:
context: talos-default
contexts:
talos-default:
....
> Verifying connectivity with existing PKI:
- 172.20.0.2: OK (version v1.8.0)
- 172.20.0.3: OK (version v1.8.0)
> Adding new Talos CA as accepted...
- 172.20.0.2: OK
- 172.20.0.3: OK
> Verifying connectivity with new client cert, but old server CA:
2024/04/17 21:26:07 retrying error: rpc error: code = Unavailable desc = connection error: desc = "error reading server preface: remote error: tls: unknown certificate authority"
- 172.20.0.2: OK (version v1.8.0)
- 172.20.0.3: OK (version v1.8.0)
> Making new Talos CA the issuing CA, old Talos CA the accepted CA...
- 172.20.0.2: OK
- 172.20.0.3: OK
> Verifying connectivity with new PKI:
2024/04/17 21:26:08 retrying error: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"x509: Ed25519 verification failure\" while trying to verify candidate authority certificate \"talos\")"
- 172.20.0.2: OK (version v1.8.0)
- 172.20.0.3: OK (version v1.8.0)
> Removing old Talos CA from the accepted CAs...
- 172.20.0.2: OK
- 172.20.0.3: OK
> Verifying connectivity with new PKI:
- 172.20.0.2: OK (version v1.8.0)
- 172.20.0.3: OK (version v1.8.0)
> Writing new talosconfig to "talosconfig"
Once the rotation is done, stash the new Talos CA, update secrets.yaml
(if using that for machine configuration generation) with new CA key and certificate.
The new client talosconfig
is written to the current directory as talosconfig
.
You can merge it to the default location with talosctl config merge ./talosconfig
.
If other client access talosconfig
files needs to be generated, use talosctl config new
with new talosconfig
.
Note: if using Talos API access from Kubernetes feature, pods might need to be restarted manually to pick up new
talosconfig
.
Manual Steps for Talos API CA Rotation
- Generate new Talos CA (e.g. use
talosctl gen secrets
and use Talos CA). - Patch machine configuration on all nodes updating
.machine.acceptedCAs
with new CA certificate. - Generate
talosconfig
with client certificate generated with new CA, but still using old CA as server CA, verify connectivity, Talos should accept new client certificate. - Patch machine configuration on all nodes updating
.machine.ca
with new CA certificate and key, and keeping old CA certificate in.machine.acceptedCAs
(on worker nodes.machine.ca
doesn’t have the key). - Generate
talosconfig
with both client certificate and server CA using new CA PKI, verify connectivity. - Remove old CA certificate from
.machine.acceptedCAs
on all nodes. - Verify connectivity.
Kubernetes API
Automated Kubernetes API CA Rotation
The automated process only rotates Kubernetes API CA, used by the kube-apiserver
, kubelet
, etc.
Other Kubernetes secrets might need to be rotated manually as required.
Kubernetes pods might need to be restarted to handle changes, and communication within the cluster might be disrupted during the rotation process.
Run the following command in dry-run mode to see the steps which will be taken:
$ talosctl -n <CONTROLPLANE> rotate-ca --dry-run=true --talos=false --kubernetes=true
> Starting Kubernetes API PKI rotation, dry-run mode true...
> Cluster topology:
- control plane nodes: ["172.20.0.2"]
- worker nodes: ["172.20.0.3"]
> Building current Kubernetes client...
> Current Kubernetes CA:
...
Before proceeding, make sure that you can capture the output of talosctl
command, as it will contain the new CA certificate and key.
As Talos API access will not be disrupted, the changes can be reverted back if needed by reverting machine configuration.
Run the following command to rotate the Kubernetes API CA:
$ talosctl -n <CONTROLPLANE> rotate-ca --dry-run=false --talos=false --kubernetes=true
> Starting Kubernetes API PKI rotation, dry-run mode false...
> Cluster topology:
- control plane nodes: ["172.20.0.2"]
- worker nodes: ["172.20.0.3"]
> Building current Kubernetes client...
> Current Kubernetes CA:
...
> New Kubernetes CA:
...
> Verifying connectivity with existing PKI...
- OK (2 nodes ready)
> Adding new Kubernetes CA as accepted...
- 172.20.0.2: OK
- 172.20.0.3: OK
> Making new Kubernetes CA the issuing CA, old Kubernetes CA the accepted CA...
- 172.20.0.2: OK
- 172.20.0.3: OK
> Building new Kubernetes client...
> Verifying connectivity with new PKI...
2024/04/17 21:45:52 retrying error: Get "https://172.20.0.1:6443/api/v1/nodes": EOF
- OK (2 nodes ready)
> Removing old Kubernetes CA from the accepted CAs...
- 172.20.0.2: OK
- 172.20.0.3: OK
> Verifying connectivity with new PKI...
- OK (2 nodes ready)
> Kubernetes CA rotation done, new 'kubeconfig' can be fetched with `talosctl kubeconfig`.
At the end of the process, Kubernetes control plane components will be restarted to pick up CA certificate changes.
Each node kubelet
will re-join the cluster with new client certficiate.
New kubeconfig
can be fetched with talosctl kubeconfig
command from the cluster.
Kubernetes pods might need to be restarted manually to pick up changes to the Kubernetes API CA.
Manual Steps for Kubernetes API CA Rotation
Steps are similar to the Talos API CA rotation, but use:
.cluster.acceptedCAs
in place of.machine.acceptedCAs
;.cluster.ca
in place of.machine.ca
;kubeconfig
in place oftalosconfig
.
5 - Customizing the Kernel
Talos Linux configures the kernel to allow loading only cryptographically signed modules. The signing key is generated during the build process, it is unique to each build, and it is not available to the user. The public key is embedded in the kernel, and it is used to verify the signature of the modules. So if you want to use a custom kernel module, you will need to build your own kernel, and all required kernel modules in order to get the signature in sync with the kernel.
Overview
In order to build a custom kernel (or a custom kernel module), the following steps are required:
- build a new Linux kernel and modules, push the artifacts to a registry
- build a new Talos base artifacts: kernel and initramfs image
- produce a new Talos boot artifact (ISO, installer image, disk image, etc.)
We will go through each step in detail.
Building a Custom Kernel
First, you might need to prepare the build environment, follow the Building Custom Images guide.
Checkout the siderolabs/pkgs
repository:
git clone https://github.com/siderolabs/pkgs.git
cd pkgs
git checkout release-1.8
The kernel configuration is located in the files kernel/build/config-ARCH
files.
It can be modified using the text editor, or by using the Linux kernel menuconfig
tool:
make kernel-menuconfig
The kernel configuration can be cleaned up by running:
make kernel-olddefconfig
Both commands will output the new configuration to the kernel/build/config-ARCH
files.
Once ready, build the kernel any out-of-tree modules (if required, e.g. zfs
) and push the artifacts to a registry:
make kernel REGISTRY=127.0.0.1:5005 PUSH=true
By default, this command will compile and push the kernel both for amd64
and arm64
architectures, but you can specify a single architecture by overriding
a variable PLATFORM
:
make kernel REGISTRY=127.0.0.1:5005 PUSH=true PLATFORM=linux/amd64
This will create a container image 127.0.0.1:5005/siderolabs/kernel:$TAG
with the kernel and modules.
Building Talos Base Artifacts
Follow the Building Custom Images guide to set up the Talos source code checkout.
If some new kernel modules were introduced, adjust the list of the default modules compiled into the Talos initramfs
by
editing the file hack/modules-ARCH.txt
.
Try building base Talos artifacts:
make kernel initramfs PKG_KERNEL=127.0.0.1:5005/siderolabs/kernel:$TAG PLATFORM=linux/amd64
This should create a new image of the kernel and initramfs in _out/vmlinuz-amd64
and _out/initramfs-amd64.xz
respectively.
Note: if building for
arm64
, replaceamd64
witharm64
in the commands above.
As a final step, produce the new imager
container image which can generate Talos boot assets:
make imager PKG_KERNEL=127.0.0.1:5005/siderolabs/kernel:$TAG PLATFORM=linux/amd64 INSTALLER_ARCH=targetarch
Note: if you built the kernel for both
amd64
andarm64
, a multi-archimager
container can be built as well by specifyingINSTALLER_ARCH=all
andPLATFORM=linux/amd64,linux/arm64
.
Building Talos Boot Assets
Follow the Boot Assets guide to build Talos boot assets you might need to boot Talos: ISO, installer
image, etc.
Replace the reference to the imager
in guide with the reference to the imager
container built above.
Note: if you update the
imager
container, don’t forget todocker pull
it, asdocker
caches pulled images and won’t pull the updated image automatically.
6 - Developing Talos
This guide outlines steps and tricks to develop Talos operating systems and related components. The guide assumes Linux operating system on the development host. Some steps might work under Mac OS X, but using Linux is highly advised.
Prepare
Check out the Talos repository.
Try running make help
to see available make
commands.
You would need Docker and buildx
installed on the host.
Note: Usually it is better to install up to date Docker from Docker apt repositories, e.g. Ubuntu instructions.
If
buildx
plugin is not available with OS docker packages, it can be installed as a plugin from GitHub releases.
Set up a builder with access to the host network:
docker buildx create --driver docker-container --driver-opt network=host --name local1 --buildkitd-flags '--allow-insecure-entitlement security.insecure' --use
Note:
network=host
allows buildx builder to access host network, so that it can push to a local container registry (see below).
Make sure the following steps work:
make talosctl
make initramfs kernel
Set up a local docker registry:
docker run -d -p 5005:5000 \
--restart always \
--name local registry:2
Try to build and push to local registry an installer image:
make installer IMAGE_REGISTRY=127.0.0.1:5005 PUSH=true
Record the image name output in the step above.
Note: it is also possible to force a stable image tag by using
TAG
variable:make installer IMAGE_REGISTRY=127.0.0.1:5005 TAG=v1.0.0-alpha.1 PUSH=true
.
Running Talos cluster
Set up local caching docker registries (this speeds up Talos cluster boot a lot), script is in the Talos repo:
bash hack/start-registry-proxies.sh
Start your local cluster with:
sudo --preserve-env=HOME _out/talosctl-linux-amd64 cluster create \
--provisioner=qemu \
--cidr=172.20.0.0/24 \
--registry-mirror docker.io=http://172.20.0.1:5000 \
--registry-mirror registry.k8s.io=http://172.20.0.1:5001 \
--registry-mirror gcr.io=http://172.20.0.1:5003 \
--registry-mirror ghcr.io=http://172.20.0.1:5004 \
--registry-mirror 127.0.0.1:5005=http://172.20.0.1:5005 \
--install-image=127.0.0.1:5005/siderolabs/installer:<RECORDED HASH from the build step> \
--controlplanes 3 \
--workers 2 \
--with-bootloader=false
--provisioner
selects QEMU vs. default Docker- custom
--cidr
to make QEMU cluster use different network than default Docker setup (optional) --registry-mirror
uses the caching proxies set up above to speed up boot time a lot, last one adds your local registry (installer image was pushed to it)--install-image
is the image you built withmake installer
above--controlplanes
&--workers
configure cluster size, choose to match your resources; 3 controlplanes give you HA control plane; 1 controlplane is enough, never do 2 controlplanes--with-bootloader=false
disables boot from disk (Talos will always boot from_out/vmlinuz-amd64
and_out/initramfs-amd64.xz
). This speeds up development cycle a lot - no need to rebuild installer and perform install, rebooting is enough to get new code.
Note: as boot loader is not used, it’s not necessary to rebuild
installer
each time (old image is fine), but sometimes it’s needed (when configuration changes are done and old installer doesn’t validate the config).
talosctl cluster create
derives Talos machine configuration version from the install image tag, so sometimes early in the development cycle (when new minor tag is not released yet), machine config version can be overridden with--talos-version=v1.8
.
If the --with-bootloader=false
flag is not enabled, for Talos cluster to pick up new changes to the code (in initramfs
), it will require a Talos upgrade (so new installer
should be built).
With --with-bootloader=false
flag, Talos always boots from initramfs
in _out/
directory, so simple reboot is enough to pick up new code changes.
If the installation flow needs to be tested, --with-bootloader=false
shouldn’t be used.
Console Logs
Watching console logs is easy with tail
:
tail -F ~/.talos/clusters/talos-default/talos-default-*.log
Interacting with Talos
Once talosctl cluster create
finishes successfully, talosconfig
and kubeconfig
will be set up automatically to point to your cluster.
Start playing with talosctl
:
talosctl -n 172.20.0.2 version
talosctl -n 172.20.0.3,172.20.0.4 dashboard
talosctl -n 172.20.0.4 get members
Same with kubectl
:
kubectl get nodes -o wide
You can deploy some Kubernetes workloads to the cluster.
You can edit machine config on the fly with talosctl edit mc --immediate
, config patches can be applied via --config-patch
flags, also many features have specific flags in talosctl cluster create
.
Quick Reboot
To reboot whole cluster quickly (e.g. to pick up a change made in the code):
for socket in ~/.talos/clusters/talos-default/talos-default-*.monitor; do echo "q" | sudo socat - unix-connect:$socket; done
Sending q
to a single socket allows to reboot a single node.
Note: This command performs immediate reboot (as if the machine was powered down and immediately powered back up), for normal Talos reboot use
talosctl reboot
.
Development Cycle
Fast development cycle:
- bring up a cluster
- make code changes
- rebuild
initramfs
withmake initramfs
- reboot a node to pick new
initramfs
- verify code changes
- more code changes…
Some aspects of Talos development require to enable bootloader (when working on installer
itself), in that case quick development cycle is no longer possible, and cluster should be destroyed and recreated each time.
Running Integration Tests
If integration tests were changed (or when running them for the first time), first rebuild the integration test binary:
rm -f _out/integration-test-linux-amd64; make _out/integration-test-linux-amd64
Running short tests against QEMU provisioned cluster:
_out/integration-test-linux-amd64 \
-talos.provisioner=qemu \
-test.v \
-talos.crashdump=false \
-test.short \
-talos.talosctlpath=$PWD/_out/talosctl-linux-amd64
Whole test suite can be run removing -test.short
flag.
Specfic tests can be run with -test.run=TestIntegration/api.ResetSuite
.
Build Flavors
make <something> WITH_RACE=1
enables Go race detector, Talos runs slower and uses more memory, but memory races are detected.
make <something> WITH_DEBUG=1
enables Go profiling and other debug features, useful for local development.
Destroying Cluster
sudo --preserve-env=HOME ../talos/_out/talosctl-linux-amd64 cluster destroy --provisioner=qemu
This command stops QEMU and helper processes, tears down bridged network on the host, and cleans up
cluster state in ~/.talos/clusters
.
Note: if the host machine is rebooted, QEMU instances and helpers processes won’t be started back. In that case it’s required to clean up files in
~/.talos/clusters/<cluster-name>
directory manually.
Optional
Set up cross-build environment with:
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
Note: the static qemu binaries which come with Ubuntu 21.10 seem to be broken.
Unit tests
Unit tests can be run in buildx with make unit-tests
, on Ubuntu systems some tests using loop
devices will fail because Ubuntu uses low-index loop
devices for snaps.
Most of the unit-tests can be run standalone as well, with regular go test
, or using IDE integration:
go test -v ./internal/pkg/circular/
This provides much faster feedback loop, but some tests require either elevated privileges (running as root
) or additional binaries available only in Talos rootfs
(containerd tests).
Running tests as root can be done with -exec
flag to go test
, but this is risky, as test code has root access and can potentially make undesired changes:
go test -exec sudo -v ./internal/app/machined/pkg/controllers/network/...
Go Profiling
Build initramfs
with debug enabled: make initramfs WITH_DEBUG=1
.
Launch Talos cluster with bootloader disabled, and use go tool pprof
to capture the profile and show the output in your browser:
go tool pprof http://172.20.0.2:9982/debug/pprof/heap
The IP address 172.20.0.2
is the address of the Talos node, and port :9982
depends on the Go application to profile:
- 9981:
apid
- 9982:
machined
- 9983:
trustd
Testing Air-gapped Environments
There is a hidden talosctl debug air-gapped
command which launches two components:
- HTTP proxy capable of proxying HTTP and HTTPS requests
- HTTPS server with a self-signed certificate
The command also writes down Talos machine configuration patch to enable the HTTP proxy and add a self-signed certificate to the list of trusted certificates:
$ talosctl debug air-gapped --advertised-address 172.20.0.1
2022/08/04 16:43:14 writing config patch to air-gapped-patch.yaml
2022/08/04 16:43:14 starting HTTP proxy on :8002
2022/08/04 16:43:14 starting HTTPS server with self-signed cert on :8001
The --advertised-address
should match the bridge IP of the Talos node.
Generated machine configuration patch looks like:
machine:
files:
- content: |
-----BEGIN CERTIFICATE-----
MIIBijCCAS+gAwIBAgIBATAKBggqhkjOPQQDAjAUMRIwEAYDVQQKEwlUZXN0IE9u
bHkwHhcNMjIwODA0MTI0MzE0WhcNMjIwODA1MTI0MzE0WjAUMRIwEAYDVQQKEwlU
ZXN0IE9ubHkwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAAQfOJdaOFSOI1I+EeP1
RlMpsDZJaXjFdoo5zYM5VYs3UkLyTAXAmdTi7JodydgLhty0pwLEWG4NUQAEvip6
EmzTo3IwcDAOBgNVHQ8BAf8EBAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsG
AQUFBwMCMA8GA1UdEwEB/wQFMAMBAf8wHQYDVR0OBBYEFCwxL+BjG0pDwaH8QgKW
Ex0J2mVXMA8GA1UdEQQIMAaHBKwUAAEwCgYIKoZIzj0EAwIDSQAwRgIhAJoW0z0D
JwpjFcgCmj4zT1SbBFhRBUX64PHJpAE8J+LgAiEAvfozZG8Or6hL21+Xuf1x9oh4
/4Hx3jozbSjgDyHOLk4=
-----END CERTIFICATE-----
permissions: 0o644
path: /etc/ssl/certs/ca-certificates
op: append
env:
http_proxy: http://172.20.0.1:8002
https_proxy: http://172.20.0.1:8002
no_proxy: 172.20.0.1/24
cluster:
extraManifests:
- https://172.20.0.1:8001/debug.yaml
The first section appends a self-signed certificate of the HTTPS server to the list of trusted certificates, followed by the HTTP proxy setup (in-cluster traffic is excluded from the proxy). The last section adds an extra Kubernetes manifest hosted on the HTTPS server.
The machine configuration patch can now be used to launch a test Talos cluster:
talosctl cluster create ... --config-patch @air-gapped-patch.yaml
The following lines should appear in the output of the talosctl debug air-gapped
command:
CONNECT discovery.talos.dev:443
: the HTTP proxy is used to talk to the discovery servicehttp: TLS handshake error from 172.20.0.2:53512: remote error: tls: bad certificate
: an expected error on Talos side, as self-signed cert is not written yet to the fileGET /debug.yaml
: Talos successfully fetches the extra manifest successfully
There might be more output depending on the registry caches being used or not.
Running Upgrade Integration Tests
Talos has a separate set of provision upgrade tests, which create a cluster on older versions of Talos, perform an upgrade, and verify that the cluster is still functional.
Build the test binary:
rm -f _out/integration-test-provision-linux-amd64; make _out/integration-test-provision-linux-amd64
Prepare the test artifacts for the upgrade test:
make release-artifacts
Build and push an installer image for the development version of Talos:
make installer IMAGE_REGISTRY=127.0.0.1:5005 PUSH=true
Run the tests (the tests will create the cluster on the older version of Talos, perform an upgrade, and verify that the cluster is still functional):
sudo --preserve-env=HOME _out/integration-test-provision-linux-amd64 \
-test.v \
-talos.talosctlpath _out/talosctl-linux-amd64 \
-talos.provision.target-installer-registry=127.0.0.1:5005 \
-talos.provision.registry-mirror 127.0.0.1:5005=http://172.20.0.1:5005,docker.io=http://172.20.0.1:5000,registry.k8s.io=http://172.20.0.1:5001,quay.io=http://172.20.0.1:5002,gcr.io=http://172.20.0.1:5003,ghcr.io=http://172.20.0.1:5004 \
-talos.provision.cidr 172.20.0.0/24
7 - Disaster Recovery
etcd
database backs Kubernetes control plane state, so if the etcd
service is unavailable,
the Kubernetes control plane goes down, and the cluster is not recoverable until etcd
is recovered.
etcd
builds around the consensus protocol Raft, so highly-available control plane clusters can tolerate the loss of nodes so long as more than half of the members are running and reachable.
For a three control plane node Talos cluster, this means that the cluster tolerates a failure of any single node,
but losing more than one node at the same time leads to complete loss of service.
Because of that, it is important to take routine backups of etcd
state to have a snapshot to recover the cluster from
in case of catastrophic failure.
Backup
Snapshotting etcd
Database
Create a consistent snapshot of etcd
database with talosctl etcd snapshot
command:
$ talosctl -n <IP> etcd snapshot db.snapshot
etcd snapshot saved to "db.snapshot" (2015264 bytes)
snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136
Note: filename
db.snapshot
is arbitrary.
This database snapshot can be taken on any healthy control plane node (with IP address <IP>
in the example above),
as all etcd
instances contain exactly same data.
It is recommended to configure etcd
snapshots to be created on some schedule to allow point-in-time recovery using the latest snapshot.
Disaster Database Snapshot
If the etcd
cluster is not healthy (for example, if quorum has already been lost), the talosctl etcd snapshot
command might fail.
In that case, copy the database snapshot directly from the control plane node:
talosctl -n <IP> cp /var/lib/etcd/member/snap/db .
This snapshot might not be fully consistent (if the etcd
process is running), but it allows
for disaster recovery when latest regular snapshot is not available.
Machine Configuration
Machine configuration might be required to recover the node after hardware failure. Backup Talos node machine configuration with the command:
talosctl -n IP get mc v1alpha1 -o yaml | yq eval '.spec' -
Recovery
Before starting a disaster recovery procedure, make sure that etcd
cluster can’t be recovered:
- get
etcd
cluster member list on all healthy control plane nodes withtalosctl -n IP etcd members
command and compare across all members. - query
etcd
health across control plane nodes withtalosctl -n IP service etcd
.
If the quorum can be restored, restoring quorum might be a better strategy than performing full disaster recovery procedure.
Latest Etcd Snapshot
Get hold of the latest etcd
database snapshot.
If a snapshot is not fresh enough, create a database snapshot (see above), even if the etcd
cluster is unhealthy.
Init Node
Make sure that there are no control plane nodes with machine type init
:
$ talosctl -n <IP1>,<IP2>,... get machinetype
NODE NAMESPACE TYPE ID VERSION TYPE
172.20.0.2 config MachineType machine-type 2 controlplane
172.20.0.4 config MachineType machine-type 2 controlplane
172.20.0.3 config MachineType machine-type 2 controlplane
Init node type is deprecated, and are incompatible with etcd
recovery procedure.
init
node can be converted to controlplane
type with talosctl edit mc --mode=staged
command followed
by node reboot with talosctl reboot
command.
Preparing Control Plane Nodes
If some control plane nodes experienced hardware failure, replace them with new nodes.
Use machine configuration backup to re-create the nodes with the same secret material and control plane settings to allow workers to join the recovered control plane.
If a control plane node is up but etcd
isn’t, wipe the node’s EPHEMERAL partition to remove the etcd
data directory (make sure a database snapshot is taken before doing this):
talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
At this point, all control plane nodes should boot up, and etcd
service should be in the Preparing
state.
The Kubernetes control plane endpoint should be pointed to the new control plane nodes if there were changes to the node addresses.
Recovering from the Backup
Make sure all etcd
service instances are in Preparing
state:
$ talosctl -n <IP> service etcd
NODE 172.20.0.2
ID etcd
STATE Preparing
HEALTH ?
EVENTS [Preparing]: Running pre state (17s ago)
[Waiting]: Waiting for service "cri" to be "up", time sync (18s ago)
[Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", time sync (20s ago)
Execute the bootstrap command against any control plane node passing the path to the etcd
database snapshot:
$ talosctl -n <IP> bootstrap --recover-from=./db.snapshot
recovering from snapshot "./db.snapshot": hash c25fd181, revision 4193, total keys 1287, total size 3035136
Note: if database snapshot was copied out directly from the
etcd
data directory usingtalosctl cp
, add flag--recover-skip-hash-check
to skip integrity check on restore.
Talos node should print matching information in the kernel log:
recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136
{"level":"info","msg":"restoring snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/li}
{"level":"info","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":3360}
{"level":"info","msg":"added member","cluster-id":"a3390e43eb5274e2","local-member-id":"0","added-peer-id":"eb4f6f534361855e","added-peer-peer-urls":["https:/}
{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}
Now etcd
service should become healthy on the bootstrap node, Kubernetes control plane components
should start and control plane endpoint should become available.
Remaining control plane nodes join etcd
cluster once control plane endpoint is up.
Single Control Plane Node Cluster
This guide applies to the single control plane clusters as well.
In fact, it is much more important to take regular snapshots of the etcd
database in single control plane node
case, as loss of the control plane node might render the whole cluster irrecoverable without a backup.
8 - Egress Domains
For some more constrained environments, it is important to whitelist only specific domains for outbound internet access. These rules will need to be updated to allow for certain domains if the user wishes to still install and bootstrap Talos from public sources. That said, users should also note that all of the following components can be mirrored locally with an internal registry, as well as a self-hosted discovery service and image factory.
The following list of egress domains was tested using a Fortinet FortiGate Next-Generation Firewall to confirm that Talos was installed, bootstrapped, and Kubernetes was fully up and running. The FortiGate allows for passing in wildcard domains and will handle resolution of those domains to defined IPs automatically. All traffic is HTTPS over port 443.
Discovery Service:
- discovery.talos.dev
Image Factory:
- factory.talos.dev
- *.azurefd.net (Azure Front Door for serving cached assets)
Google Container Registry / Google Artifact Registry (GCR/GAR):
- gcr.io
- storage.googleapis.com (backing blob storage for images)
- *.pkg.dev (backing blob storage for images)
Github Container Registry (GHCR)
- ghcr.io
- *.githubusercontent.com (backing blob storage for images)
Kubernetes Registry (k8s.io)
- registry.k8s.io
- *.s3.dualstack.us-east-1.amazonaws.com (backing blob storage for images)
Note: In this testing, DNS and NTP servers were updated to use those services that are built-in to the FortiGate. These may also need to be allowed if the user cannot make use of internal services. Additionally,these rules only cover that which is required for Talos to be fully installed and running. There may be other domains like docker.io that must be allowed for non-default CNIs or workload container images.
9 - etcd Maintenance
etcd
database backs Kubernetes control plane state, so etcd
health is critical for Kubernetes availability.
Space Quota
etcd
default database space quota is set to 2 GiB by default.
If the database size exceeds the quota, etcd
will stop operations until the issue is resolved.
This condition can be checked with talosctl etcd alarm list
command:
$ talosctl -n <IP> etcd alarm list
NODE MEMBER ALARM
172.20.0.2 a49c021e76e707db NOSPACE
If the Kubernetes database contains lots of resources, space quota can be increased to match the actual usage. The recommended maximum size is 8 GiB.
To increase the space quota, edit the etcd
section in the machine configuration:
cluster:
etcd:
extraArgs:
quota-backend-bytes: 4294967296 # 4 GiB
Once the node is rebooted with the new configuration, use talosctl etcd alarm disarm
to clear the NOSPACE
alarm.
Defragmentation
etcd
database can become fragmented over time if there are lots of writes and deletes.
Kubernetes API server performs automatic compaction of the etcd
database, which marks deleted space as free and ready to be reused.
However, the space is not actually freed until the database is defragmented.
If the database is heavily fragmented (in use/db size ratio is less than 0.5), defragmentation might increase the performance. If the database runs over the space quota (see above), but the actual in use database size is small, defragmentation is required to bring the on-disk database size below the limit.
Current database size can be checked with talosctl etcd status
command:
$ talosctl -n <CP1>,<CP2>,<CP3> etcd status
NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER ERRORS
172.20.0.3 ecebb05b59a776f1 21 MB 6.0 MB (29.08%) ecebb05b59a776f1 53391 4 53391 false
172.20.0.2 a49c021e76e707db 17 MB 4.5 MB (26.10%) ecebb05b59a776f1 53391 4 53391 false
172.20.0.4 eb47fb33e59bf0e2 20 MB 5.9 MB (28.96%) ecebb05b59a776f1 53391 4 53391 false
If any of the nodes are over database size quota, alarms will be printed in the ERRORS
column.
To defragment the database, run talosctl etcd defrag
command:
talosctl -n <CP1> etcd defrag
Note: defragmentation is a resource-intensive operation, so it is recommended to run it on a single node at a time. Defragmentation to a live member blocks the system from reading and writing data while rebuilding its state.
Once the defragmentation is complete, the database size will match closely to the in use size:
$ talosctl -n <CP1> etcd status
NODE MEMBER DB SIZE IN USE LEADER RAFT INDEX RAFT TERM RAFT APPLIED INDEX LEARNER ERRORS
172.20.0.2 a49c021e76e707db 4.5 MB 4.5 MB (100.00%) ecebb05b59a776f1 56065 4 56065 false
Snapshotting
Regular backups of etcd
database should be performed to ensure that the cluster can be restored in case of a failure.
This procedure is described in the disaster recovery guide.
10 - Extension Services
Talos provides a way to run additional system services early in the Talos boot process. Extension services should be included into the Talos root filesystem (e.g. using system extensions). Extension services run as privileged containers with ephemeral root filesystem located in the Talos root filesystem.
Extension services can be used to use extend core features of Talos in a way that is not possible via static pods or Kubernetes DaemonSets.
Potential extension services use-cases:
- storage: Open iSCSI, software RAID, etc.
- networking: BGP FRR, etc.
- platform integration: VMWare open VM tools, etc.
Configuration
Talos on boot scans directory /usr/local/etc/containers
for *.yaml
files describing the extension services to run.
Format of the extension service config:
name: hello-world
container:
entrypoint: ./hello-world
environment:
- XDG_RUNTIME_DIR=/run
args:
- -f
mounts:
- # OCI Mount Spec
depends:
- configuration: true
- service: cri
- path: /run/machined/machined.sock
- network:
- addresses
- connectivity
- hostname
- etcfiles
- time: true
restart: never|always|untilSuccess
logToConsole: true|false
name
Field name
sets the service name, valid names are [a-z0-9-_]+
.
The service container root filesystem path is derived from the name
: /usr/local/lib/containers/<name>
.
The extension service will be registered as a Talos service under an ext-<name>
identifier.
container
entrypoint
defines the container entrypoint relative to the container root filesystem (/usr/local/lib/containers/<name>
)environmentFile
(deprecated) defines the path to a file containing environment variables, the service waits for the file to exist before starting. UseExtensionServiceConfig
instead.environment
defines the container environment variables.args
defines the additional arguments to pass to the entrypointmounts
defines the volumes to be mounted into the container root
container.mounts
The section mounts
uses the standard OCI spec:
- source: /var/log/audit
destination: /var/log/audit
type: bind
options:
- rshared
- bind
- ro
All requested directories will be mounted into the extension service container mount namespace.
If the source
directory doesn’t exist in the host filesystem, it will be created (only for writable paths in the Talos root filesystem).
container.security
The section security
follows this example:
maskedPaths:
- "/should/be/masked"
readonlyPaths:
- "/path/that/should/be/readonly"
- "/another/readonly/path"
writeableRootfs: true
writeableSysfs: true
rootfsPropagation: shared
- The rootfs is readonly by default unless
writeableRootfs: true
is set.- The sysfs is readonly by default unless
writeableSysfs: true
is set.- Masked paths if not set defaults to containerd defaults. Masked paths will be mounted to
/dev/null
. To set empty masked paths use:container: security: maskedPaths: []
- Read Only paths if not set defaults to containerd defaults. Read-only paths will be mounted to
/dev/null
. To set empty read only paths use:container: security: readonlyPaths: []
- Rootfs propagation is not set by default (container mounts are private).
depends
The depends
section describes extension service start dependencies: the service will not be started until all dependencies are met.
Available dependencies:
service: <name>
: wait for the service<name>
to be running and healthypath: <path>
: wait for the<path>
to existnetwork: [addresses, connectivity, hostname, etcfiles]
: wait for the specified network readiness checks to succeedtime: true
: wait for the NTP time syncconfiguration: true
: wait forExtensionServiceConfig
resource with a name matching the extension name to be available. The mounts specified in theExtensionServiceConfig
will be added as extra mounts to the extension service.
restart
Field restart
defines the service restart policy, it allows to either configure an always running service or a one-shot service:
always
: restart service alwaysnever
: start service only once and never restartuntilSuccess
: restart failing service, stop restarting on successful run
logToConsole
Field logToConsole
defines whether the service logs should also be written to the console, i.e., to kernel log buffer (or to the container logs in container mode).
This feature is particularly useful for debugging extensions that operate in maintenance mode or early in the boot process when service logs cannot be accessed yet.
Example
Example layout of the Talos root filesystem contents for the extension service:
/
└── usr
└── local
├── etc
│ └── containers
│ └── hello-world.yaml
└── lib
└── containers
└── hello-world
├── hello
└── config.ini
Talos discovers the extension service configuration in /usr/local/etc/containers/hello-world.yaml
:
name: hello-world
container:
entrypoint: ./hello
args:
- --config
- config.ini
depends:
- network:
- addresses
restart: always
Talos starts the container for the extension service with container root filesystem at /usr/local/lib/containers/hello-world
:
/
├── hello
└── config.ini
Extension service is registered as ext-hello-world
in talosctl services
:
$ talosctl service ext-hello-world
NODE 172.20.0.5
ID ext-hello-world
STATE Running
HEALTH ?
EVENTS [Running]: Started task ext-hello-world (PID 1100) for container ext-hello-world (2m47s ago)
[Preparing]: Creating service runner (2m47s ago)
[Preparing]: Running pre state (2m47s ago)
[Waiting]: Waiting for service "containerd" to be "up" (2m48s ago)
[Waiting]: Waiting for service "containerd" to be "up", network (2m49s ago)
An extension service can be started, restarted and stopped using talosctl service ext-hello-world start|restart|stop
.
Use talosctl logs ext-hello-world
to get the logs of the service.
Complete example of the extension service can be found in the extensions repository.
11 - Machine Configuration OAuth2 Authentication
talos.config=
) on metal
platform using OAuth.Talos Linux when running on the metal
platform can be configured to authenticate the machine configuration download using OAuth2 device flow.
The machine configuration is fetched from the URL specified with talos.config
kernel argument, and by default this HTTP request is not authenticated.
When the OAuth2 authentication is enabled, Talos will authenticate the request using OAuth device flow first, and then pass the token to the machine configuration download endpoint.
Prerequisites
Obtain the following information:
- OAuth client ID (mandatory)
- OAuth client secret (optional)
- OAuth device endpoint
- OAuth token endpoint
- OAuth scopes, audience (optional)
- OAuth client secret (optional)
- extra Talos variables to send to the device auth endpoint (optional)
Configuration
Set the following kernel parameters on the initial Talos boot to enable the OAuth flow:
talos.config
set to the URL of the machine configuration endpoint (which will be authenticated using OAuth)talos.config.oauth.client_id
set to the OAuth client ID (required)talos.config.oauth.client_secret
set to the OAuth client secret (optional)talos.config.oauth.scope
set to the OAuth scopes (optional, repeat the parameter for multiple scopes)talos.config.oauth.audience
set to the OAuth audience (optional)talos.config.oauth.device_auth_url
set to the OAuth device endpoint (if not set defaults totalos.config
URL with the path/device/code
)talos.config.oauth.token_url
set to the OAuth token endpoint (if not set defaults totalos.config
URL with the path/token
)talos.config.oauth.extra_variable
set to the extra Talos variables to send to the device auth endpoint (optional, repeat the parameter for multiple variables)
The list of variables supported by the talos.config.oauth.extra_variable
parameter is same as the list of variables supported by the talos.config
parameter.
Flow
On the initial Talos boot, when machine configuration is not available, Talos will print the following messages:
[talos] downloading config {"component": "controller-runtime", "controller": "config.AcquireController", "platform": "metal"}
[talos] waiting for network to be ready
[talos] [OAuth] starting the authentication device flow with the following settings:
[talos] [OAuth] - client ID: "<REDACTED>"
[talos] [OAuth] - device auth URL: "https://oauth2.googleapis.com/device/code"
[talos] [OAuth] - token URL: "https://oauth2.googleapis.com/token"
[talos] [OAuth] - extra variables: ["uuid" "mac"]
[talos] waiting for variables: [uuid mac]
[talos] waiting for variables: [mac]
[talos] [OAuth] please visit the URL https://www.google.com/device and enter the code <REDACTED>
[talos] [OAuth] waiting for the device to be authorized (expires at 14:46:55)...
If the OAuth service provides the complete verification URL, the QR code to scan is also printed to the console:
[talos] [OAuth] or scan the following QR code:
█████████████████████████████████
█████████████████████████████████
████ ▄▄▄▄▄ ██▄▀▀ ▀█ ▄▄▄▄▄ ████
████ █ █ █▄ ▀▄██▄██ █ █ ████
████ █▄▄▄█ ██▀▄██▄ ▀█ █▄▄▄█ ████
████▄▄▄▄▄▄▄█ ▀ █ ▀ █▄█▄▄▄▄▄▄▄████
████ ▀ ▄▄ ▄█ ██▄█ ███▄█▀████
████▀█▄ ▄▄▀▄▄█▀█▄██ ▄▀▄██▄ ▄████
████▄██▀█▄▄▄███▀ ▀█▄▄ ██ █▄ ████
████▄▀▄▄▄ ▄███ ▄ ▀ ▀▀▄▀▄▀█▄ ▄████
████▄█████▄█ █ ██ ▀ ▄▄▄ █▀▀████
████ ▄▄▄▄▄ █ █ ▀█▄█▄ █▄█ █▄ ████
████ █ █ █▄ ▄▀ ▀█▀▄▄▄ ▀█▄████
████ █▄▄▄█ █ ██▄ ▀ ▀███ ▀█▀▄████
████▄▄▄▄▄▄▄█▄▄█▄██▄▄▄▄█▄███▄▄████
█████████████████████████████████
Once the authentication flow is complete on the OAuth provider side, Talos will print the following message:
[talos] [OAuth] device authorized
[talos] fetching machine config from: "http://example.com/config.yaml"
[talos] machine config loaded successfully {"component": "controller-runtime", "controller": "config.AcquireController", "sources": ["metal"]}
12 - Metal Network Configuration
META
-based network configuration on Talos metal
platform.Note: This is an advanced feature which requires deep understanding of Talos and Linux network configuration.
Talos Linux when running on a cloud platform (e.g. AWS or Azure), uses the platform-provided metadata server to provide initial network configuration to the node. When running on bare-metal, there is no metadata server, so there are several options to provide initial network configuration (before machine configuration is acquired):
- use automatic network configuration via DHCP (Talos default)
- use initial boot kernel command line parameters to configure networking
- use automatic network configuration via DHCP just enough to fetch machine configuration and then use machine configuration to set desired advanced configuration.
If DHCP option is available, it is by far the easiest way to configure networking. The initial boot kernel command line parameters are not very flexible, and they are not persisted after initial Talos installation.
Talos starting with version 1.4.0 offers a new option to configure networking on bare-metal: META
-based network configuration.
Note:
META
-based network configuration is only available on Talos Linuxmetal
platform.
Talos dashboard provides a way to configure META
-based network configuration for a machine using the console, but
it doesn’t support all kinds of network configuration.
Network Configuration Format
Talos META
-based network configuration is a YAML file with the following format:
addresses:
- address: 147.75.61.43/31
linkName: bond0
family: inet4
scope: global
flags: permanent
layer: platform
- address: 2604:1380:45f2:6c00::1/127
linkName: bond0
family: inet6
scope: global
flags: permanent
layer: platform
- address: 10.68.182.1/31
linkName: bond0
family: inet4
scope: global
flags: permanent
layer: platform
links:
- name: eth0
up: true
masterName: bond0
slaveIndex: 0
layer: platform
- name: eth1
up: true
masterName: bond0
slaveIndex: 1
layer: platform
- name: bond0
logical: true
up: true
mtu: 0
kind: bond
type: ether
bondMaster:
mode: 802.3ad
xmitHashPolicy: layer3+4
lacpRate: slow
arpValidate: none
arpAllTargets: any
primaryReselect: always
failOverMac: 0
miimon: 100
updelay: 200
downdelay: 200
resendIgmp: 1
lpInterval: 1
packetsPerSlave: 1
numPeerNotif: 1
tlbLogicalLb: 1
adActorSysPrio: 65535
layer: platform
routes:
- family: inet4
gateway: 147.75.61.42
outLinkName: bond0
table: main
priority: 1024
scope: global
type: unicast
protocol: static
layer: platform
- family: inet6
gateway: '2604:1380:45f2:6c00::'
outLinkName: bond0
table: main
priority: 2048
scope: global
type: unicast
protocol: static
layer: platform
- family: inet4
dst: 10.0.0.0/8
gateway: 10.68.182.0
outLinkName: bond0
table: main
scope: global
type: unicast
protocol: static
layer: platform
hostnames:
- hostname: ci-blue-worker-amd64-2
layer: platform
resolvers: []
timeServers: []
Every section is optional, so you can configure only the parts you need.
The format of each section matches the respective network *Spec
resource .spec
part, e.g the addresses:
section matches the .spec
of AddressSpec
resource:
# talosctl get addressspecs bond0/10.68.182.1/31 -o yaml | yq .spec
address: 10.68.182.1/31
linkName: bond0
family: inet4
scope: global
flags: permanent
layer: platform
So one way to prepare the network configuration file is to boot Talos Linux, apply necessary network configuration using Talos machine configuration, and grab the resulting resources from the running Talos instance.
In this guide we will briefly cover the most common examples of the network configuration.
Addresses
The addresses configured are usually routable IP addresses assigned to the machine, so
the scope:
should be set to global
and flags:
to permanent
.
Additionally, family:
should be set to either inet4
or init6
depending on the address family.
The linkName:
property should match the name of the link the address is assigned to, it might be a physical link,
e.g. en9sp0
, or the name of a logical link, e.g. bond0
, created in the links:
section.
Example, IPv4 address:
addresses:
- address: 147.75.61.43/31
linkName: bond0
family: inet4
scope: global
flags: permanent
layer: platform
Example, IPv6 address:
addresses:
- address: 2604:1380:45f2:6c00::1/127
linkName: bond0
family: inet6
scope: global
flags: permanent
layer: platform
Links
For physical network interfaces (links), the most usual configuration is to bring the link up:
links:
- name: en9sp0
up: true
layer: platform
This will bring the link up, and it will also disable Talos auto-configuration (disables running DHCP on the link).
Another common case is to set a custom MTU:
links:
- name: en9sp0
up: true
mtu: 9000
layer: platform
The order of the links in the links:
section is not important.
Bonds
For bonded links, there should be a link resource for the bond itself, and a link resource for each enslaved link:
links:
- name: bond0
logical: true
up: true
kind: bond
type: ether
bondMaster:
mode: 802.3ad
xmitHashPolicy: layer3+4
lacpRate: slow
arpValidate: none
arpAllTargets: any
primaryReselect: always
failOverMac: 0
miimon: 100
updelay: 200
downdelay: 200
resendIgmp: 1
lpInterval: 1
packetsPerSlave: 1
numPeerNotif: 1
tlbLogicalLb: 1
adActorSysPrio: 65535
layer: platform
- name: eth0
up: true
masterName: bond0
slaveIndex: 0
layer: platform
- name: eth1
up: true
masterName: bond0
slaveIndex: 1
layer: platform
The name of the bond can be anything supported by Linux kernel, but the following properties are important:
logical: true
- this is a logical link, not a physical onekind: bond
- this is a bonded linktype: ether
- this is an Ethernet linkbondMaster:
- defines bond configuration, please see Linux documentation on the available options
For each enslaved link, the following properties are important:
masterName: bond0
- the name of the bond this link is enslaved toslaveIndex: 0
- the index of the enslaved link, starting from 0, controls the order of bond slaves
VLANs
VLANs are logical links which have a parent link, and a VLAN ID and protocol:
links:
- name: bond0.35
logical: true
up: true
kind: vlan
type: ether
parentName: bond0
vlan:
vlanID: 35
vlanProtocol: 802.1ad
The name of the VLAN link can be anything supported by Linux kernel, but the following properties are important:
logical: true
- this is a logical link, not a physical onekind: vlan
- this is a VLAN linktype: ether
- this is an Ethernet linkparentName: bond0
- the name of the parent linkvlan:
- defines VLAN configuration:vlanID
andvlanProtocol
Routes
For route configuration, most of the time table: main
, scope: global
, type: unicast
and protocol: static
are used.
The route most important fields are:
dst:
defines the destination network, if left empty means “default gateway”gateway:
defines the gateway addresspriority:
defines the route priority (metric), lower values are preferred for the samedst:
networkoutLinkName:
defines the name of the link the route is associated withsrc:
sets the source address for the route (optional)
Additionally, family:
should be set to either inet4
or init6
depending on the address family.
Example, IPv6 default gateway:
routes:
- family: inet6
gateway: '2604:1380:45f2:6c00::'
outLinkName: bond0
table: main
priority: 2048
scope: global
type: unicast
protocol: static
layer: platform
Example, IPv4 route to 10/8
via 10.68.182.0
gateway:
routes:
- family: inet4
dst: 10.0.0.0/8
gateway: 10.68.182.0
outLinkName: bond0
table: main
scope: global
type: unicast
protocol: static
layer: platform
Hostnames
Even though the section supports multiple hostnames, only a single one should be used:
hostnames:
- hostname: host
domainname: some.org
layer: platform
The domainname:
is optional.
If the hostname is not set, Talos will use default generated hostname.
Resolvers
The resolvers:
section is used to configure DNS resolvers, only single entry should be used:
resolvers:
- dnsServers:
- 8.8.8.8
- 1.1.1.1
layer: platform
If the dnsServers:
is not set, Talos will use default DNS servers.
Time Servers
The timeServers:
section is used to configure NTP time servers, only single entry should be used:
timeServers:
- timeServers:
- 169.254.169.254
layer: platform
If the timeServers:
is not set, Talos will use default NTP servers.
Supplying META
Network Configuration
Once the network configuration YAML document is ready, it can be supplied to Talos in one of the following ways:
- for a running Talos machine, using Talos API (requires already established network connectivity)
- for Talos disk images, it can be embedded into the image
- for ISO/PXE boot methods, it can be supplied via kernel command line parameters as an environment variable
The metal network configuration is stored in Talos META
partition under the key 0xa
(decimal 10).
In this guide we will assume that the prepared network configuration is stored in the file network.yaml
.
Note: as JSON is a subset of YAML, the network configuration can be also supplied as a JSON document.
Supplying Network Configuration to a Running Talos Machine
Use the talosctl
to write a network configuration to a running Talos machine:
talosctl meta write 0xa "$(cat network.yaml)"
Supplying Network Configuration to a Talos Disk Image
Following the boot assets guide, create a disk image passing the network configuration as a --meta
flag:
docker run --rm -t -v $PWD/_out:/out -v /dev:/dev --privileged ghcr.io/siderolabs/imager:v1.8.0 metal --meta "0xa=$(cat network.yaml)"
Supplying Network Configuration to a Talos ISO/PXE Boot
As there is no META
partition created yet before Talos Linux is installed, META
values can be set as an environment variable INSTALLER_META_BASE64
passed to the initial boot of Talos.
The supplied value will be used immediately, and also it will be written to the META
partition once Talos is installed.
When using imager
to create the ISO, the INSTALLER_META_BASE64
environment variable will be automatically generated from the --meta
flag:
$ docker run --rm -t -v $PWD/_out:/out ghcr.io/siderolabs/imager:v1.8.0 iso --meta "0xa=$(cat network.yaml)"
...
kernel command line: ... talos.environment=INSTALLER_META_BASE64=MHhhPWZvbw==
When PXE booting, the value of INSTALLER_META_BASE64
should be set manually:
echo -n "0xa=$(cat network.yaml)" | gzip -9 | base64
The resulting base64 string should be passed as an environment variable INSTALLER_META_BASE64
to the initial boot of Talos: talos.environment=INSTALLER_META_BASE64=<base64-encoded value>
.
Getting Current META
Network Configuration
Talos exports META
keys as resources:
# talosctl get meta 0x0a -o yaml
...
spec:
value: '{"addresses": ...}'
13 - Migrating from Kubeadm
It is possible to migrate Talos from a cluster that is created using kubeadm to Talos.
High-level steps are the following:
- Collect CA certificates and a bootstrap token from a control plane node.
- Create a Talos machine config with the CA certificates with the ones you collected.
- Update control plane endpoint in the machine config to point to the existing control plane (i.e. your load balancer address).
- Boot a new Talos machine and apply the machine config.
- Verify that the new control plane node is ready.
- Remove one of the old control plane nodes.
- Repeat the same steps for all control plane nodes.
- Verify that all control plane nodes are ready.
- Repeat the same steps for all worker nodes, using the machine config generated for the workers.
Remarks on kube-apiserver load balancer
While migrating to Talos, you need to make sure that your kube-apiserver load balancer is in place and keeps pointing to the correct set of control plane nodes.
This process depends on your load balancer setup.
If you are using an LB that is external to the control plane nodes (e.g. cloud provider LB, F5 BIG-IP, etc.), you need to make sure that you update the backend IPs of the load balancer to point to the control plane nodes as you add Talos nodes and remove kubeadm-based ones.
If your load balancing is done on the control plane nodes (e.g. keepalived + haproxy on the control plane nodes), you can do the following:
- Add Talos nodes and remove kubeadm-based ones while updating the haproxy backends to point to the newly added nodes except the last kubeadm-based control plane node.
- Turn off keepalived to drop the virtual IP used by the kubeadm-based nodes (introduces kube-apiserver downtime).
- Set up a virtual-IP based new load balancer on the new set of Talos control plane nodes. Use the previous LB IP as the LB virtual IP.
- Verify apiserver connectivity over the Talos-managed virtual IP.
- Migrate the last control-plane node.
Prerequisites
- Admin access to the kubeadm-based cluster
- Access to the
/etc/kubernetes/pki
directory (e.g. SSH & root permissions) on the control plane nodes of the kubeadm-based cluster - Access to kube-apiserver load-balancer configuration
Step-by-step guide
Download
/etc/kubernetes/pki
directory from a control plane node of the kubeadm-based cluster.Create a new join token for the new control plane nodes:
# inside a control plane node kubeadm token create --ttl 0
Create Talos secrets from the PKI directory you downloaded on step 1 and the token you generated on step 2:
talosctl gen secrets --kubernetes-bootstrap-token <TOKEN> --from-kubernetes-pki <PKI_DIR>
Create a new Talos config from the secrets:
talosctl gen config --with-secrets secrets.yaml <CLUSTER_NAME> https://<EXISTING_CLUSTER_LB_IP>
Collect the information about the kubeadm-based cluster from the kubeadm configmap:
kubectl get configmap -n kube-system kubeadm-config -oyaml
Take note of the following information in the
ClusterConfiguration
:.controlPlaneEndpoint
.networking.dnsDomain
.networking.podSubnet
.networking.serviceSubnet
Replace the following information in the generated
controlplane.yaml
:.cluster.network.cni.name
withnone
.cluster.network.podSubnets[0]
with the value of thenetworking.podSubnet
from the previous step.cluster.network.serviceSubnets[0]
with the value of thenetworking.serviceSubnet
from the previous step.cluster.network.dnsDomain
with the value of thenetworking.dnsDomain
from the previous step
Go through the rest of
controlplane.yaml
andworker.yaml
to customize them according to your needs, especially :.cluster.secretboxEncryptionSecret
should be either removed if you don’t currently useEncryptionConfig
on yourkube-apiserver
or set to the correct value
Make sure that, on your current Kubeadm cluster, the first
--service-account-issuer=
parameter in/etc/kubernetes/manifests/kube-apiserver.yaml
is equal to the value of.cluster.controlPlane.endpoint
incontrolplane.yaml
. If it’s not, add a new--service-account-issuer=
parameter with the correct value before your current one in/etc/kubernetes/manifests/kube-apiserver.yaml
on all of your control planes nodes, and restart the kube-apiserver containers.Bring up a Talos node to be the initial Talos control plane node.
Apply the generated
controlplane.yaml
to the Talos control plane node:talosctl --nodes <TALOS_NODE_IP> apply-config --insecure --file controlplane.yaml
Wait until the new control plane node joins the cluster and is ready.
kubectl get node -owide --watch
Update your load balancer to point to the new control plane node.
Drain the old control plane node you are replacing:
kubectl drain <OLD_NODE> --delete-emptydir-data --force --ignore-daemonsets --timeout=10m
Remove the old control plane node from the cluster:
kubectl delete node <OLD_NODE>
Destroy the old node:
# inside the node sudo kubeadm reset --force
Repeat the same steps, starting from step 7, for all control plane nodes.
Repeat the same steps, starting from step 7, for all worker nodes while applying the
worker.yaml
instead and skipping the LB step:talosctl --nodes <TALOS_NODE_IP> apply-config --insecure --file worker.yaml
Your kubeadm
kube-proxy
configuration may not be compatible with the one generated by Talos, which will make the Talos Kubernetes upgrades impossible (labels may not be the same, andselector.matchLabels
is an immutable field). To be sure, export your current kube-proxy daemonset manifest, check the labels, they have to be:tier: node k8s-app: kube-proxy
If the are not, modify all the labels fields, save the file, delete your current kube-proxy daemonset, and apply the one you modified.
14 - Overlays
Overlays provide a way to customize Talos Linux boot image. Overlays hook into the Talos install steps and can be used to provide additional boot assets (in the case of single board computers), extra kernel arguments or some custom configuration that is not part of the default Talos installation and specific to a particular overlay.
Overlays v/s Extensions
Overlays are similar to extensions, but they are used to customize the installation process, while extensions are used to customize the root filesystem.
Official Overlays
The list of official overlays can be found in the Overlays GitHub repository.
Using Overlays
Overlays can be used to generate a modified metal image or installer image with the overlay applied.
The process of generating boot assets with overlays is described in the boot assets guide.
Example: Booting a Raspberry Pi 4 with an Overlay
Follow the board specific guide for Raspberry Pi to download or generate the metal disk image and write to an SD card.
Boot the machine with the boot media and apply the machine configuration with the installer image that has the overlay applied.
# Talos machine configuration patch
machine:
install:
image: factory.talos.dev/installer/fc1cceeb5711cd263877b6b808fbf4942a8deda65e8804c114a0b5bae252dc50:v1.8.0
Note: The schematic id shown in the above patch is for a vanilla
rpi_generic
overlay. Replace it with the schematic id of the overlay you want to apply.
Authoring Overlays
An Overlay is a container image with the specific folder structure.
Overlays can be built and managed using any tool that produces container images, e.g. docker build
.
Sidero Labs maintains a repository of overlays.
Developing An Overlay
Let’s assume that you would like to contribute an overlay for a specific board, e.g. by contributing to the sbc-rockchip
repository.
Clone the repositry and insepct the existing overlays to understand the structure.
Usually an overlay consist of a few key components:
firmware
: contains the firmware files required for the boardbootloader
: contains the bootloader, e.g.u-boot
for the boarddtb
: contains the device tree blobs for the boardinstaller
: contains the installer that will be used to install this overlay on the nodeprofile
: contains information
- For the new overlay, create any needed folders and
pkg.yaml
files. - If your board introduces a new chipset that is not supported yet, make sure to add the firmware build for it.
- Add the necessary
u-boot
anddtb
build steps to thepkg.yaml
files. - Proceed to add an installer, which is a small go binary that will be used to install the overlay on the node.
Here you need to add the go
src/
as well as thepkg.yaml
file. - Lastly, add the profile information in the
profiles
folder.
You are now ready to attempt building the overlay. It’s recommend to push the build to a container registry to test the overlay with the Talos installer.
The default settings are:
REGISTRY
is set toghcr.io
USERNAME
is set to thesiderolabs
(or value of environment variableUSERNAME
if it is set)
make sbc-rockchip PUSH=true
If using a custom registry, the REGISTRY
and USERNAME
variables can be set:
make sbc-rockchip PUSH=true REGISTRY=<registry> USERNAME=<username>
After building the overlay, take note of the pushed image tag, e.g. 664638a
, because you will need it for the next step.
You can now build a flashable image using the command below.
export TALOS_VERSION=v1.7.6
export USERNAME=octocat
export BOARD=nanopi-r5s
export TAG=664638a
docker run --rm -t -v ./_out:/out -v /dev:/dev --privileged ghcr.io/siderolabs/imager:${TALOS_VERSION} \
metal --arch arm64 \
--base-installer-image="ghcr.io/siderolabs/installer:${TALOS_VERSION}" \
--overlay-name="${BOARD}" \
--overlay-image="ghcr.io/${USERNAME}/sbc-rockchip:${TAG}" \
–overlay-option
--overlay-option
can be used to pass additional options to the overlay installer if they are implemented by the overlay.
An example can be seen in the sbc-raspberrypi overlay repository.
It supports passing multiple options by repeating the flag or can be read from a yaml document by passing --overlay-option=@<path to file>
.
Note: In some cases you need to build a custom imager. In this case, refer to this guide on how to build a custom images using an imager.
Troubleshooting
IMPORTANT: If this does not succeed, have a look at the documentation of the external dependecies you are pulling in and make sure that the
pkg.yaml
files are correctly configured. In some cases it may be required to update the dependencies to an appropriate version via thePkgfile
.
15 - Proprietary Kernel Modules
Patching and building the kernel image
Clone the
pkgs
repository from Github and check out the revision corresponding to your version of Talos Linuxgit clone https://github.com/talos-systems/pkgs pkgs && cd pkgs git checkout v0.8.0
Clone the Linux kernel and check out the revision that pkgs uses (this can be found in
kernel/kernel-prepare/pkg.yaml
and it will be something like the following:https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-x.xx.x.tar.xz
)git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git && cd linux git checkout v5.15
Your module will need to be converted to be in-tree. The steps for this are different depending on the complexity of the module to port, but generally it would involve moving the module source code into the
drivers
tree and creating a new Makefile and Kconfig.Stage your changes in Git with
git add -A
.Run
git diff --cached --no-prefix > foobar.patch
to generate a patch from your changes.Copy this patch to
kernel/kernel/patches
in thepkgs
repo.Add a
patch
line in theprepare
segment ofkernel/kernel/pkg.yaml
:patch -p0 < /pkg/patches/foobar.patch
Build the kernel image. Make sure you are logged in to
ghcr.io
before running this command, and you can change or omitPLATFORM
depending on what you want to target.make kernel PLATFORM=linux/amd64 USERNAME=your-username PUSH=true
Make a note of the image name the
make
command outputs.
Building the installer image
Copy the following into a new
Dockerfile
:FROM scratch AS customization COPY --from=ghcr.io/your-username/kernel:<kernel version> /lib/modules /lib/modules FROM ghcr.io/siderolabs/installer:<talos version> COPY --from=ghcr.io/your-username/kernel:<kernel version> /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
Run to build and push the installer:
INSTALLER_VERSION=<talos version> IMAGE_NAME="ghcr.io/your-username/talos-installer:$INSTALLER_VERSION" DOCKER_BUILDKIT=0 docker build --build-arg RM="/lib/modules" -t "$IMAGE_NAME" . && docker push "$IMAGE_NAME"
Deploying to your cluster
talosctl upgrade --image ghcr.io/your-username/talos-installer:<talos version> --preserve=true
16 - Static Pods
Static Pods
Static pods are run directly by the kubelet
bypassing the Kubernetes API server checks and validations.
Most of the time DaemonSet
is a better alternative to static pods, but some workloads need to run
before the Kubernetes API server is available or might need to bypass security restrictions imposed by the API server.
See Kubernetes documentation for more information on static pods.
Configuration
Static pod definitions are specified in the Talos machine configuration:
machine:
pods:
- apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
Talos renders static pod definitions to the kubelet
using a local HTTP server, kubelet
picks up the definition and launches the pod.
Talos accepts changes to the static pod configuration without a reboot.
To see a full list of static pods, use talosctl get staticpods
, and to see the status of the static pods (as reported by the kubelet
), use talosctl get staticpodstatus
.
Usage
Kubelet mirrors pod definition to the API server state, so static pods can be inspected with kubectl get pods
, logs can be retrieved with kubectl logs
, etc.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-talos-default-controlplane-2 1/1 Running 0 17s
If the API server is not available, status of the static pod can also be inspected with talosctl containers --kubernetes
:
$ talosctl containers --kubernetes
NODE NAMESPACE ID IMAGE PID STATUS
172.20.0.3 k8s.io default/nginx-talos-default-controlplane-2 registry.k8s.io/pause:3.6 4886 SANDBOX_READY
172.20.0.3 k8s.io └─ default/nginx-talos-default-controlplane-2:nginx:4183a7d7a771 docker.io/library/nginx:latest
...
Logs of static pods can be retrieved with talosctl logs --kubernetes
:
$ talosctl logs --kubernetes default/nginx-talos-default-controlplane-2:nginx:4183a7d7a771
172.20.0.3: 2022-02-10T15:26:01.289208227Z stderr F 2022/02/10 15:26:01 [notice] 1#1: using the "epoll" event method
172.20.0.3: 2022-02-10T15:26:01.2892466Z stderr F 2022/02/10 15:26:01 [notice] 1#1: nginx/1.21.6
172.20.0.3: 2022-02-10T15:26:01.28925723Z stderr F 2022/02/10 15:26:01 [notice] 1#1: built by gcc 10.2.1 20210110 (Debian 10.2.1-6)
Troubleshooting
Talos doesn’t perform any validation on the static pod definitions.
If the pod isn’t running, use kubelet
logs (talosctl logs kubelet
) to find the problem:
$ talosctl logs kubelet
172.20.0.2: {"ts":1644505520281.427,"caller":"config/file.go:187","msg":"Could not process manifest file","path":"/etc/kubernetes/manifests/talos-default-nginx-gvisor.yaml","err":"invalid pod: [spec.containers: Required value]"}
Resource Definitions
Static pod definitions are available as StaticPod
resources combined with Talos-generated control plane static pods:
$ talosctl get staticpods
NODE NAMESPACE TYPE ID VERSION
172.20.0.3 k8s StaticPod default-nginx 1
172.20.0.3 k8s StaticPod kube-apiserver 1
172.20.0.3 k8s StaticPod kube-controller-manager 1
172.20.0.3 k8s StaticPod kube-scheduler 1
Talos assigns ID <namespace>-<name>
to the static pods specified in the machine configuration.
On control plane nodes status of the running static pods is available in the StaticPodStatus
resource:
$ talosctl get staticpodstatus
NODE NAMESPACE TYPE ID VERSION READY
172.20.0.3 k8s StaticPodStatus default/nginx-talos-default-controlplane-2 2 True
172.20.0.3 k8s StaticPodStatus kube-system/kube-apiserver-talos-default-controlplane-2 2 True
172.20.0.3 k8s StaticPodStatus kube-system/kube-controller-manager-talos-default-controlplane-2 3 True
172.20.0.3 k8s StaticPodStatus kube-system/kube-scheduler-talos-default-controlplane-2 3 True
17 - Talos API access from Kubernetes
In this guide, we will enable the Talos feature to access the Talos API from within Kubernetes.
Enabling the Feature
Edit the machine configuration to enable the feature, specifying the Kubernetes namespaces from which Talos API can be accessed and the allowed Talos API roles.
talosctl -n 172.20.0.2 edit machineconfig
Configure the kubernetesTalosAPIAccess
like the following:
spec:
machine:
features:
kubernetesTalosAPIAccess:
enabled: true
allowedRoles:
- os:reader
allowedKubernetesNamespaces:
- default
Injecting Talos ServiceAccount into manifests
Create the following manifest file deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: talos-api-access
spec:
selector:
matchLabels:
app: talos-api-access
template:
metadata:
labels:
app: talos-api-access
spec:
containers:
- name: talos-api-access
image: alpine:3
command:
- sh
- -c
- |
wget -O /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.8.0/talosctl-linux-amd64
chmod +x /usr/local/bin/talosctl
while true; talosctl -n 172.20.0.2 version; do sleep 1; done
Note: make sure that you replace the IP 172.20.0.2
with a valid Talos node IP.
Use talosctl inject serviceaccount
command to inject the Talos ServiceAccount into the manifest.
talosctl inject serviceaccount -f deployment.yaml > deployment-injected.yaml
Inspect the generated manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
name: talos-api-access
spec:
selector:
matchLabels:
app: talos-api-access
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: talos-api-access
spec:
containers:
- command:
- sh
- -c
- |
wget -O /usr/local/bin/talosctl https://github.com/siderolabs/talos/releases/download/v1.8.0/talosctl-linux-amd64
chmod +x /usr/local/bin/talosctl
while true; talosctl -n 172.20.0.2 version; do sleep 1; done
image: alpine:3
name: talos-api-access
resources: {}
volumeMounts:
- mountPath: /var/run/secrets/talos.dev
name: talos-secrets
tolerations:
- operator: Exists
volumes:
- name: talos-secrets
secret:
secretName: talos-api-access-talos-secrets
status: {}
---
apiVersion: talos.dev/v1alpha1
kind: ServiceAccount
metadata:
name: talos-api-access-talos-secrets
spec:
roles:
- os:reader
---
As you can notice, your deployment manifest is now injected with the Talos ServiceAccount.
Testing API Access
Apply the new manifest into default
namespace:
kubectl apply -n default -f deployment-injected.yaml
Follow the logs of the pods belong to the deployment:
kubectl logs -n default -f -l app=talos-api-access
You’ll see a repeating output similar to the following:
Client:
Tag: <talos version>
SHA: ....
Built:
Go version: go1.18.4
OS/Arch: linux/amd64
Server:
NODE: 172.20.0.2
Tag: <talos version>
SHA: ...
Built:
Go version: go1.18.4
OS/Arch: linux/amd64
Enabled: RBAC
This means that the pod can talk to Talos API of node 172.20.0.2 successfully.
18 - Verifying Images
Sidero Labs signs the container images generated for the Talos release with cosign:
ghcr.io/siderolabs/installer
(Talos installer)ghcr.io/siderolabs/talos
(Talos image for container runtime)ghcr.io/siderolabs/talosctl
(talosctl
client packaged as a container image)ghcr.io/siderolabs/imager
(Talos install image generator)- all system extension images
Verifying Container Image Signatures
The cosign
tool can be used to verify the signatures of the Talos container images:
$ cosign verify --certificate-identity-regexp '@siderolabs\.com$' --certificate-oidc-issuer https://accounts.google.com ghcr.io/siderolabs/installer:v1.4.0
Verification for ghcr.io/siderolabs/installer:v1.4.0 --
The following checks were performed on each of these signatures:
- The cosign claims were validated
- Existence of the claims in the transparency log was verified offline
- The code-signing certificate was verified using trusted certificate authority certificates
[{"critical":{"identity":{"docker-reference":"ghcr.io/siderolabs/installer"},"image":{"docker-manifest-digest":"sha256:f41795cc88f40eb1bc6b3c638c4a3123f6ef3c90627bfc35c04ebab82581e3ee"},"type":"cosign container image signature"},"optional":{"1.3.6.1.4.1.57264.1.1":"https://accounts.google.com","Bundle":{"SignedEntryTimestamp":"MEQCIERkQpgEnPWnfjUHIWO9QxC9Ute3/xJOc7TO5GUnu59xAiBKcFvrDWHoUYChT0/+gaazTrI+r0/GWSbi+Q+sEQ5AKA==","Payload":{"body":"eyJhcGlWZXJzaW9uIjoiMC4wLjEiLCJraW5kIjoiaGFzaGVkcmVrb3JkIiwic3BlYyI6eyJkYXRhIjp7Imhhc2giOnsiYWxnb3JpdGhtIjoic2hhMjU2IiwidmFsdWUiOiJkYjhjYWUyMDZmODE5MDlmZmI4NjE4ZjRkNjIzM2ZlYmM3NzY5MzliOGUxZmZkMTM1ODA4ZmZjNDgwNjYwNGExIn19LCJzaWduYXR1cmUiOnsiY29udGVudCI6Ik1FVUNJUURQWXhiVG5vSDhJTzBEakRGRE9rNU1HUjRjMXpWMys3YWFjczNHZ2J0TG1RSWdHczN4dVByWUgwQTAvM1BSZmZydDRYNS9nOUtzQVdwdG9JbE9wSDF0NllrPSIsInB1YmxpY0tleSI6eyJjb250ZW50IjoiTFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVTXhha05EUVd4NVowRjNTVUpCWjBsVlNIbEhaRTFQVEhkV09WbFFSbkJYUVRKb01qSjRVM1ZIZVZGM2QwTm5XVWxMYjFwSmVtb3dSVUYzVFhjS1RucEZWazFDVFVkQk1WVkZRMmhOVFdNeWJHNWpNMUoyWTIxVmRWcEhWakpOVWpSM1NFRlpSRlpSVVVSRmVGWjZZVmRrZW1SSE9YbGFVekZ3WW01U2JBcGpiVEZzV2tkc2FHUkhWWGRJYUdOT1RXcE5kMDVFUlRSTlZHZDZUbXBWTlZkb1kwNU5hazEzVGtSRk5FMVVaekJPYWxVMVYycEJRVTFHYTNkRmQxbElDa3R2V2tsNmFqQkRRVkZaU1V0dldrbDZhakJFUVZGalJGRm5RVVZaUVdKaVkwbDZUVzR3ZERBdlVEZHVUa0pNU0VscU1rbHlORTFQZGpoVVRrVjZUemNLUkVadVRXSldVbGc0TVdWdmExQnVZblJHTVZGMmRWQndTVm95VkV3NFFUUkdSMWw0YldFeGJFTk1kMkk0VEZOVWMzRlBRMEZZYzNkblowWXpUVUUwUndwQk1WVmtSSGRGUWk5M1VVVkJkMGxJWjBSQlZFSm5UbFpJVTFWRlJFUkJTMEpuWjNKQ1owVkdRbEZqUkVGNlFXUkNaMDVXU0ZFMFJVWm5VVlZqYWsweUNrbGpVa1lyTkhOVmRuRk5ia3hsU0ZGMVJIRkdRakZqZDBoM1dVUldVakJxUWtKbmQwWnZRVlV6T1ZCd2VqRlphMFZhWWpWeFRtcHdTMFpYYVhocE5Ga0tXa1E0ZDB0M1dVUldVakJTUVZGSUwwSkRSWGRJTkVWa1dWYzFhMk50VmpWTWJrNTBZVmhLZFdJeldrRmpNbXhyV2xoS2RtSkhSbWxqZVRWcVlqSXdkd3BMVVZsTFMzZFpRa0pCUjBSMmVrRkNRVkZSWW1GSVVqQmpTRTAyVEhrNWFGa3lUblprVnpVd1kzazFibUl5T1c1aVIxVjFXVEk1ZEUxRGMwZERhWE5IQ2tGUlVVSm5OemgzUVZGblJVaFJkMkpoU0ZJd1kwaE5Oa3g1T1doWk1rNTJaRmMxTUdONU5XNWlNamx1WWtkVmRWa3lPWFJOU1VkTFFtZHZja0puUlVVS1FXUmFOVUZuVVVOQ1NIZEZaV2RDTkVGSVdVRXpWREIzWVhOaVNFVlVTbXBIVWpSamJWZGpNMEZ4U2t0WWNtcGxVRXN6TDJnMGNIbG5Remh3TjI4MFFRcEJRVWRJYkdGbVp6Um5RVUZDUVUxQlVucENSa0ZwUVdKSE5tcDZiVUkyUkZCV1dUVXlWR1JhUmtzeGVUSkhZVk5wVW14c1IydHlSRlpRVXpsSmJGTktDblJSU1doQlR6WlZkbnBFYVVOYVFXOXZSU3RLZVdwaFpFdG5hV2xLT1RGS00yb3ZZek5CUTA5clJIcFhOamxaVUUxQmIwZERRM0ZIVTAwME9VSkJUVVFLUVRKblFVMUhWVU5OUVZCSlRUVjJVbVpIY0VGVWNqQTJVR1JDTURjeFpFOXlLMHhFSzFWQ04zbExUVWRMWW10a1UxTnJaMUp5U3l0bGNuZHdVREp6ZGdvd1NGRkdiM2h0WlRkM1NYaEJUM2htWkcxTWRIQnpjazFJZGs5cWFFSmFTMVoxVG14WmRXTkJaMVF4V1VWM1ZuZHNjR2QzYTFWUFdrWjRUemRrUnpONkNtVnZOWFJ3YVdoV1kyTndWMlozUFQwS0xTMHRMUzFGVGtRZ1EwVlNWRWxHU1VOQlZFVXRMUzB0TFFvPSJ9fX19","integratedTime":1681843022,"logIndex":18304044,"logID":"c0d23d6ad406973f9559f3ba2d1ca01f84147d8ffc5b8445c224f98b9591801d"}},"Issuer":"https://accounts.google.com","Subject":"andrey.smirnov@siderolabs.com"}}]
The image should be signed using cosign certificate authority flow by a Sidero Labs employee with and email from siderolabs.com
domain.
Reproducible Builds
Talos builds for kernel
, initramfs
, talosctl
, ISO image, and container images are reproducible.
So you can verify that the build is the same as the one as provided on GitHub releases page.
See building Talos images for more details.
19 - Watchdog Timers
Talos Linux now supports hardware watchdog timers configuration. Hardware watchdog timers allow to reset (reboot) the system if the software stack becomes unresponsive. Please consult your hardware/VM documentation for the availability of the hardware watchdog timers.
Configuration
To discover the available watchdog devices, run:
$ talosctl ls /sys/class/watchdog/
NODE NAME
172.20.0.2 .
172.20.0.2 watchdog0
172.20.0.2 watchdog1
The implementation of the watchdog device can be queried with:
$ talosctl read /sys/class/watchdog/watchdog0/identity
i6300ESB timer
To enable the watchdog timer, patch the machine configuration with the following:
# watchdog.yaml
apiVersion: v1alpha1
kind: WatchdogTimerConfig
device: /dev/watchdog0
timeout: 5m
talosctl patch mc -p @watchdog.yaml
Talos Linux will set up the watchdog time with a 5-minute timeout, and it will keep resetting the timer to prevent the system from rebooting. If the software becomes unresponsive, the watchdog timer will expire, and the system will be reset by the watchdog hardware.
Inspection
To inspect the watchdog timer configuration, run:
$ talosctl get watchdogtimerconfig
NODE NAMESPACE TYPE ID VERSION DEVICE TIMEOUT
172.20.0.2 runtime WatchdogTimerConfig timer 1 /dev/watchdog0 5m0s
To inspect the watchdog timer status, run:
$ talosctl get watchdogtimerstatus
NODE NAMESPACE TYPE ID VERSION DEVICE TIMEOUT
172.20.0.2 runtime WatchdogTimerStatus timer 1 /dev/watchdog0 5m0s
Current status of the watchdog timer can also be inspected via Linux sysfs:
$ talosctl read /sys/class/watchdog/watchdog0/state
active