This is the multi-page printable view of this section. Click here to print.
Introduction
1 - What is Talos?
Talos is a container optimized Linux distro; a reimagining of Linux for distributed systems such as Kubernetes. Designed to be as minimal as possible while still maintaining practicality. For these reasons, Talos has a number of features unique to it:
- it is immutable
- it is atomic
- it is ephemeral
- it is minimal
- it is secure by default
- it is managed via a single declarative configuration file and gRPC API
Talos can be deployed on container, cloud, virtualized, and bare metal platforms.
Why Talos
In having less, Talos offers more. Security. Efficiency. Resiliency. Consistency.
All of these areas are improved simply by having less.
2 - Quickstart
Local Docker Cluster
The easiest way to try Talos is by using the CLI (talosctl
) to create a cluster on a machine with docker
installed.
Prerequisites
talosctl
Download talosctl
(macOS or Linux):
brew install siderolabs/tap/talosctl
kubectl
Download kubectl
via one of methods outlined in the documentation.
Create the Cluster
Now run the following:
talosctl cluster create
Note
If you are using Docker Desktop on a macOS computer, if you encounter the error: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? you may need to manually create the link for the Docker socket:sudo ln -s "$HOME/.docker/run/docker.sock" /var/run/docker.sock
You can explore using Talos API commands:
talosctl dashboard --nodes 10.5.0.2
Verify that you can reach Kubernetes:
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
talos-default-controlplane-1 Ready master 115s v1.33.0 10.5.0.2 <none> Talos (v1.10.0-alpha.0) <host kernel> containerd://1.5.5
talos-default-worker-1 Ready <none> 115s v1.33.0 10.5.0.3 <none> Talos (v1.10.0-alpha.0) <host kernel> containerd://1.5.5
Destroy the Cluster
When you are all done, remove the cluster:
talosctl cluster destroy
3 - Getting Started
This document will walk you through installing a simple Talos Cluster with a single control plane node and one or more worker nodes, explaining some of the concepts.
If this is your first use of Talos Linux, we recommend the Quickstart first, to quickly create a local virtual cluster in containers on your workstation.
For a production cluster, extra steps are needed - see Production Notes.
Regardless of where you run Talos, the steps to create a Kubernetes cluster are:
- boot machines off the Talos Linux image
- define the endpoint for the Kubernetes API and generate your machine configurations
- configure Talos Linux by applying machine configurations to the machines
- configure
talosctl
- bootstrap Kubernetes
Prerequisites
talosctl
talosctl
is a CLI tool which interfaces with the Talos API.
Talos Linux has no SSH access: talosctl
is the tool you use to interact with the operating system on the machines.
You can download talosctl
an MacOS and Linux via:
brew install siderolabs/tap/talosctl
For manually installation and other platform please see the talosctl installation guide.
Note: If you boot systems off the ISO, Talos on the ISO image runs in RAM and acts as an installer. The version of
talosctl
that is used to create the machine configurations controls the version of Talos Linux that is installed on the machines - NOT the image that the machines are initially booted off. For example, booting a machine off the Talos 1.3.7 ISO, but creating the initial configuration withtalosctl
binary of version 1.4.1, will result in a machine running Talos Linux version 1.4.1.It is advisable to use the same version of
talosctl
as the version of the boot media used.
Network access
This guide assumes that the systems being installed have outgoing access to the internet, allowing them to pull installer and container images, query NTP, etc. If needed, see the documentation on registry proxies, local registries, and airgapped installation.
Acquire the Talos Linux image and boot machines
The most general way to install Talos Linux is to use the ISO image.
The latest ISO image can be found on the Github Releases page:
- X86: https://github.com/siderolabs/talos/releases/download/v1.10.0-alpha.0/metal-amd64.iso
- ARM64: https://github.com/siderolabs/talos/releases/download/v1.10.0-alpha.0/metal-arm64.iso
When booted from the ISO, Talos will run in RAM and will not install to disk until provided a configuration. Thus, it is safe to boot any machine from the ISO.
At this point, you should:
- boot one machine off the ISO to be the control plane node
- boot one or more machines off the same ISO to be the workers
Alternative Booting
For network booting and self-built media, see Production Notes. There are installation methods specific to specific platforms, such as pre-built AMIs for AWS - check the specific Installation Guides.)
Define the Kubernetes Endpoint
In order to configure Kubernetes, Talos needs to know what the endpoint of the Kubernetes API Server will be.
Because we are only creating a single control plane node in this guide, we can use the control plane node directly as the Kubernetes API endpoint.
Identify the IP address or DNS name of the control plane node that was booted above, and convert it to a fully-qualified HTTPS URL endpoint address for the Kubernetes API Server which (by default) runs on port 6443. The endpoint should be formatted like:
https://192.168.0.2:6443
https://kube.mycluster.mydomain.com:6443
NOTE: For a production cluster, you should have three control plane nodes, and have the endpoint allocate traffic to all three - see Production Notes.
Configure Talos Linux
When Talos boots without a configuration, such as when booting off the Talos ISO, it enters maintenance mode and waits for a configuration to be provided.
A configuration can be passed in on boot via kernel parameters or metadata servers. See Production Notes.
Unlike traditional Linux, Talos Linux is not configured by SSHing to the server and issuing commands.
Instead, the entire state of the machine is defined by a machine config
file which is passed to the server.
This allows machines to be managed in a declarative way, and lends itself to GitOps and modern operations paradigms.
The state of a machine is completely defined by, and can be reproduced from, the machine configuration file.
To generate the machine configurations for a cluster, run this command on the workstation where you installed talosctl
:
talosctl gen config <cluster-name> <cluster-endpoint>
cluster-name
is an arbitrary name, used as a label in your local client configuration.
It should be unique in the configuration on your local workstation.
cluster-endpoint
is the Kubernetes Endpoint you constructed from the control plane node’s IP address or DNS name above.
It should be a complete URL, with https://
and port.
For example:
$ talosctl gen config mycluster https://192.168.0.2:6443
generating PKI and tokens
created /Users/taloswork/controlplane.yaml
created /Users/taloswork/worker.yaml
created /Users/taloswork/talosconfig
When you run this command, three files are created in your current directory:
controlplane.yaml
worker.yaml
talosconfig
The .yaml
files are Machine Configs: they describe everything from what disk Talos should be installed on, to network settings.
The controlplane.yaml
file also describes how Talos should form a Kubernetes cluster.
The talosconfig
file is your local client configuration file, used to connect to and authenticate access to the cluster.
Controlplane and Worker
The two types of Machine Configs correspond to the two roles of Talos nodes, control plane nodes (which run both the Talos and Kubernetes control planes) and worker nodes (which run the workloads).
The main difference between Controlplane Machine Config files and Worker Machine Config files is that the former contains information about how to form the Kubernetes cluster.
Modifying the Machine configs
The generated Machine Configs have defaults that work for most cases.
They use DHCP for interface configuration, and install to /dev/sda
.
Sometimes, you will need to modify the generated files to work with your systems. A common case is needing to change the installation disk. If you try to to apply the machine config to a node, and get an error like the below, you need to specify a different installation disk:
$ talosctl apply-config --insecure -n 192.168.0.2 --file controlplane.yaml
error applying new configuration: rpc error: code = InvalidArgument desc = configuration validation failed: 1 error occurred:
* specified install disk does not exist: "/dev/sda"
You can verify which disks your nodes have by using the talosctl disks --insecure
command.
Insecure mode is needed at this point as the PKI infrastructure has not yet been set up.
For example, the talosctl disks
command below shows that the system has a vda
drive, not an sda
:
$ talosctl -n 192.168.0.2 disks --insecure
DEV MODEL SERIAL TYPE UUID WWID MODALIAS NAME SIZE BUS_PATH
/dev/vda - - HDD - - virtio:d00000002v00001AF4 - 69 GB /pci0000:00/0000:00:06.0/virtio2/
In this case, you would modify the controlplane.yaml
and worker.yaml
files and edit the line:
install:
disk: /dev/sda # The disk used for installations.
to reflect vda
instead of sda
.
For information on customizing your machine configurations (such as to specify the version of Kubernetes), using machine configuration patches, or customizing configurations for individual machines (such as setting static IP addresses), see the Production Notes.
Accessing the Talos API
Administrative tasks are performed by calling the Talos API (usually with talosctl
) on Talos Linux control plane nodes, who may forward the requests to other nodes.
Thus:
- ensure your control plane node is directly reachable on TCP port 50000 from the workstation where you run the
talosctl
client. - until a node is a member of the cluster, it does not have the PKI infrastructure set up, and so will not accept API requests that are proxied through a control plane node.
Thus you will need direct access to the worker nodes on port 50000 from the workstation where you run talosctl
in order to apply the initial configuration.
Once the cluster is established, you will no longer need port 50000 access to the workers.
(You can avoid requiring such access by passing in the initial configuration in one of other methods, such as by cloud userdata
or via talos.config=
kernel argument on a metal
platform)
This may require changing firewall rules or cloud provider access-lists.
For production configurations, see Production Notes.
Understand how talosctl treats endpoints and nodes
In short: endpoints
are where talosctl
sends commands to, but the command operates on the specified nodes
.
The endpoint will forward the command to the nodes, if needed.
Endpoints
Endpoints are the IP addresses of control plane nodes, to which the talosctl
client directly talks.
Endpoints automatically proxy requests destined to another node in the cluster. This means that you only need access to the control plane nodes in order to manage the rest of the cluster.
You can pass in --endpoints <Control Plane IP Address>
or -e <Control Plane IP Address>
to the current talosctl
command.
In this tutorial setup, the endpoint will always be the single control plane node.
Nodes
Nodes are the target(s) you wish to perform the operation on.
When specifying nodes, the IPs and/or hostnames are as seen by the endpoint servers, not as from the client. This is because all connections are proxied through the endpoints.
You may provide -n
or --nodes
to any talosctl
command to supply the node or (comma-separated) nodes on which you wish to perform the operation.
For example, to see the containers running on node 192.168.0.200, by routing the containers
command through the control plane endpoint 192.168.0.2:
talosctl -e 192.168.0.2 -n 192.168.0.200 containers
To see the etcd logs on both nodes 192.168.0.10 and 192.168.0.11:
talosctl -e 192.168.0.2 -n 192.168.0.10,192.168.0.11 logs etcd
For a more in-depth discussion of Endpoints and Nodes, please see talosctl.
Apply Configuration
To apply the Machine Configs, you need to know the machines’ IP addresses.
Talos prints the IP addresses of the machines on the console during the boot process:
[4.605369] [talos] task loadConfig (1/1): this machine is reachable at:
[4.607358] [talos] task loadConfig (1/1): 192.168.0.2
If you do not have console access, the IP address may also be discoverable from your DHCP server.
Once you have the IP address, you can then apply the correct configuration.
Apply the controlplane.yaml
file to the control plane node, and the worker.yaml
file to all the worker node(s).
talosctl apply-config --insecure \
--nodes 192.168.0.2 \
--file controlplane.yaml
The --insecure
flag is necessary because the PKI infrastructure has not yet been made available to the node.
Note: the connection will be encrypted, but not authenticated.
When using the
--insecure
flag, you cannot specify an endpoint, and must directly access the node on port 50000.
Default talosconfig configuration file
You reference which configuration file to use by the --talosconfig
parameter:
talosctl --talosconfig=./talosconfig \
--nodes 192.168.0.2 -e 192.168.0.2 version
Note that talosctl
comes with tooling to help you integrate and merge this configuration into the default talosctl
configuration file.
See Production Notes for more information.
While getting started, a common mistake is referencing a configuration context for a different cluster, resulting in authentication or connection failures. Thus it is recommended to explicitly pass in the configuration file while becoming familiar with Talos Linux.
Kubernetes Bootstrap
Bootstrapping your Kubernetes cluster with Talos is as simple as calling talosctl bootstrap
on your control plane node:
talosctl bootstrap --nodes 192.168.0.2 --endpoints 192.168.0.2 \
--talosconfig=./talosconfig
The bootstrap operation should only be called ONCE on a SINGLE control plane node. (If you have multiple control plane nodes, it doesn’t matter which one you issue the bootstrap command against.)
At this point, Talos will form an etcd
cluster, and start the Kubernetes control plane components.
After a few moments, you will be able to download your Kubernetes client configuration and get started:
talosctl kubeconfig --nodes 192.168.0.2 --endpoints 192.168.0.2 \
--talosconfig=./talosconfig
Running this command will add (merge) you new cluster into your local Kubernetes configuration.
If you would prefer the configuration to not be merged into your default Kubernetes configuration file, pass in a filename:
talosctl kubeconfig alternative-kubeconfig --nodes 192.168.0.2 --endpoints 192.168.0.2 \
--talosconfig=./talosconfig
You should now be able to connect to Kubernetes and see your nodes:
kubectl get nodes
And use talosctl to explore your cluster:
talosctl --nodes 192.168.0.2 --endpoints 192.168.0.2 health \
--talosconfig=./talosconfig
talosctl --nodes 192.168.0.2 --endpoints 192.168.0.2 dashboard \
--talosconfig=./talosconfig
For a list of all the commands and operations that talosctl
provides, see the CLI reference.
4 - Production Clusters
This document explains recommendations for running Talos Linux in production.
Acquire the installation image
Alternative Booting
For network booting and self-built media, you can use the published kernel and initramfs images:
Note that to use alternate booting, there are a number of required kernel parameters. Please see the kernel docs for more information.
Control plane nodes
For a production, highly available Kubernetes cluster, it is recommended to use three control plane nodes. Using five nodes can provide greater fault tolerance, but imposes more replication overhead and can result in worse performance.
Boot all three control plane nodes at this point. They will boot Talos Linux, and come up in maintenance mode, awaiting a configuration.
Decide the Kubernetes Endpoint
The Kubernetes API Server endpoint, in order to be highly available, should be configured in a way that uses all available control plane nodes. There are three common ways to do this: using a load-balancer, using Talos Linux’s built in VIP functionality, or using multiple DNS records.
Dedicated Load-balancer
If you are using a cloud provider or have your own load-balancer (such as HAProxy, Nginx reverse proxy, or an F5 load-balancer), a dedicated load balancer is a natural choice. Create an appropriate frontend for the endpoint, listening on TCP port 6443, and point the backends at the addresses of each of the Talos control plane nodes. Your Kubernetes endpoint will be the IP address or DNS name of the load balancer front end, with the port appended (e.g. https://myK8s.mydomain.io:6443).
Note: an HTTP load balancer can’t be used, as Kubernetes API server does TLS termination and mutual TLS authentication.
Layer 2 VIP Shared IP
Talos has integrated support for serving Kubernetes from a shared/virtual IP address. This requires Layer 2 connectivity between control plane nodes.
Choose an unused IP address on the same subnet as the control plane nodes for the VIP. For instance, if your control plane node IPs are:
- 192.168.0.10
- 192.168.0.11
- 192.168.0.12
you could choose the IP 192.168.0.15
as your VIP IP address.
(Make sure that 192.168.0.15
is not used by any other machine and is excluded from DHCP ranges.)
Once chosen, form the full HTTPS URL from this IP:
https://192.168.0.15:6443
If you create a DNS record for this IP, note you will need to use the IP address itself, not the DNS name, to configure the shared IP (machine.network.interfaces[].vip.ip
) in the Talos configuration.
After the machine configurations are generated, you will want to edit the controlplane.yaml
file to activate the VIP:
machine:
network:
interfaces:
- interface: enp2s0
dhcp: true
vip:
ip: 192.168.0.15
For more information about using a shared IP, see the related Guide
DNS records
Add multiple A or AAAA records (one for each control plane node) to a DNS name.
For instance, you could add:
kube.cluster1.mydomain.com IN A 192.168.0.10
kube.cluster1.mydomain.com IN A 192.168.0.11
kube.cluster1.mydomain.com IN A 192.168.0.12
where the IP addresses are those of the control plane nodes.
Then, your endpoint would be:
https://kube.cluster1.mydomain.com:6443
Multihoming
If your machines are multihomed, i.e., they have more than one IPv4 and/or IPv6 addresses other than loopback, then additional configuration is required. A point to note is that the machines may become multihomed via privileged workloads.
Multihoming and etcd
The etcd
cluster needs to establish a mesh of connections among the members.
It is done using the so-called advertised address - each node learns the others’ addresses as they are advertised.
It is crucial that these IP addresses are stable, i.e., that each node always advertises the same IP address.
Moreover, it is beneficial to control them to establish the correct routes between the members and, e.g., avoid congested paths.
In Talos, these addresses are controlled using the cluster.etcd.advertisedSubnets
configuration key.
Multihoming and kubelets
Stable IP addressing for kubelets (i.e., nodeIP) is not strictly necessary but highly recommended as it ensures that, e.g., kube-proxy and CNI routing take the desired routes.
Analogously to etcd, for kubelets this is controlled via machine.kubelet.nodeIP.validSubnets
.
Example
Let’s assume that we have a cluster with two networks:
- public network
- private network
192.168.0.0/16
We want to use the private network for etcd and kubelet communication:
machine:
kubelet:
nodeIP:
validSubnets:
- 192.168.0.0/16
#...
cluster:
etcd:
advertisedSubnets: # listenSubnets defaults to advertisedSubnets if not set explicitly
- 192.168.0.0/16
This way we ensure that the etcd
cluster will use the private network for communication and the kubelets will use the private network for communication with the control plane.
Load balancing the Talos API
The talosctl
tool provides built-in client-side load-balancing across control plane nodes, so usually you do not need to configure a load balancer for the Talos API.
However, if the control plane nodes are not directly reachable from the workstation where you run talosctl
, then configure a load balancer to forward TCP port 50000 to the control plane nodes.
Note: Because the Talos Linux API uses gRPC and mutual TLS, it cannot be proxied by a HTTP/S proxy, but only by a TCP load balancer.
If you create a load balancer to forward the Talos API calls, the load balancer IP or hostname will be used as the endpoint
for talosctl
.
Add the load balancer IP or hostname to the .machine.certSANs
field of the machine configuration file.
Do not use Talos Linux’s built in VIP function for accessing the Talos API. In the event of an error in
etcd
, the VIP will not function, and you will not be able to access the Talos API to recover.
Configure Talos
In many installation methods, a configuration can be passed in on boot.
For example, Talos can be booted with the talos.config
kernel
argument set to an HTTP(s) URL from which it should receive its
configuration.
Where a PXE server is available, this is much more efficient than
manually configuring each node.
If you do use this method, note that Talos requires a number of other
kernel commandline parameters.
See required kernel parameters.
Similarly, if creating EC2 kubernetes clusters, the configuration file can be passed in as --user-data
to the aws ec2 run-instances
command.
See generally the Installation Guide for the platform being deployed.
Separating out secrets
When generating the configuration files for a Talos Linux cluster, it is recommended to start with generating a secrets bundle which should be saved in a secure location. This bundle can be used to generate machine or client configurations at any time:
talosctl gen secrets -o secrets.yaml
The
secrets.yaml
can also be extracted from the existing controlplane machine configuration withtalosctl gen secrets --from-controlplane-config controlplane.yaml -o secrets.yaml
command.
Now, we can generate the machine configuration for each node:
talosctl gen config --with-secrets secrets.yaml <cluster-name> <cluster-endpoint>
Here, cluster-name
is an arbitrary name for the cluster, used
in your local client configuration as a label.
It should be unique in the configuration on your local workstation.
The cluster-endpoint
is the Kubernetes Endpoint you
selected from above.
This is the Kubernetes API URL, and it should be a complete URL, with https://
and port.
(The default port is 6443
, but you may have configured your load balancer to forward a different port.)
For example:
$ talosctl gen config --with-secrets secrets.yaml my-cluster https://192.168.64.15:6443
generating PKI and tokens
created controlplane.yaml
created worker.yaml
created talosconfig
Customizing Machine Configuration
The generated machine configuration provides sane defaults for most cases, but can be modified to fit specific needs.
Some machine configuration options are available as flags for the talosctl gen config
command,
for example setting a specific Kubernetes version:
talosctl gen config --with-secrets secrets.yaml --kubernetes-version 1.25.4 my-cluster https://192.168.64.15:6443
Other modifications are done with machine configuration patches.
Machine configuration patches can be applied with talosctl gen config
command:
talosctl gen config --with-secrets secrets.yaml --config-patch-control-plane @cni.patch my-cluster https://192.168.64.15:6443
Note:
@cni.patch
means that the patch is read from a file namedcni.patch
.
Machine Configs as Templates
Individual machines may need different settings: for instance, each may have a different static IP address.
When different files are needed for machines of the same type, there are two supported flows:
- Use the
talosctl gen config
command to generate a template, and then patch the template for each machine withtalosctl machineconfig patch
. - Generate each machine configuration file separately with
talosctl gen config
while applying patches.
For example, given a machine configuration patch which sets the static machine hostname:
# worker1.patch
machine:
network:
hostname: worker1
Either of the following commands will generate a worker machine configuration file with the hostname set to worker1
:
$ talosctl gen config --with-secrets secrets.yaml my-cluster https://192.168.64.15:6443
created /Users/taloswork/controlplane.yaml
created /Users/taloswork/worker.yaml
created /Users/taloswork/talosconfig
$ talosctl machineconfig patch worker.yaml --patch @worker1.patch --output worker1.yaml
talosctl gen config --with-secrets secrets.yaml --config-patch-worker @worker1.patch --output-types worker -o worker1.yaml my-cluster https://192.168.64.15:6443
Apply Configuration while validating the node identity
If you have console access you can extract the server certificate fingerprint and use it for an additional layer of validation:
talosctl apply-config --insecure \
--nodes 192.168.0.2 \
--cert-fingerprint xA9a1t2dMxB0NJ0qH1pDzilWbA3+DK/DjVbFaJBYheE= \
--file cp0.yaml
Using the fingerprint allows you to be sure you are sending the configuration to the correct machine, but is completely optional. After the configuration is applied to a node, it will reboot. Repeat this process for each of the nodes in your cluster.
Further details about talosctl, endpoints and nodes
Endpoints
When passed multiple endpoints, talosctl
will automatically load balance requests to, and fail over between, all endpoints.
You can pass in --endpoints <IP Address1>,<IP Address2>
as a comma separated list of IP/DNS addresses to the current talosctl
command.
You can also set the endpoints
in your talosconfig
, by calling talosctl config endpoint <IP Address1> <IP Address2>
.
Note: these are space separated, not comma separated.
As an example, if the IP addresses of our control plane nodes are:
- 192.168.0.2
- 192.168.0.3
- 192.168.0.4
We would set those in the talosconfig
with:
talosctl --talosconfig=./talosconfig \
config endpoint 192.168.0.2 192.168.0.3 192.168.0.4
Nodes
The node is the target you wish to perform the API call on.
It is possible to set a default set of nodes in the talosconfig
file, but our recommendation is to explicitly pass in the node or nodes to be operated on with each talosctl
command.
For a more in-depth discussion of Endpoints and Nodes, please see talosctl.
Default configuration file
You can reference which configuration file to use directly with the --talosconfig
parameter:
talosctl --talosconfig=./talosconfig \
--nodes 192.168.0.2 version
However, talosctl
comes with tooling to help you integrate and merge this configuration into the default talosctl
configuration file.
This is done with the merge
option.
talosctl config merge ./talosconfig
This will merge your new talosconfig
into the default configuration file ($XDG_CONFIG_HOME/talos/config.yaml
), creating it if necessary.
Like Kubernetes, the talosconfig
configuration files has multiple “contexts” which correspond to multiple clusters.
The <cluster-name>
you chose above will be used as the context name.
Kubernetes Bootstrap
Bootstrapping your Kubernetes cluster by simply calling the bootstrap
command against any of your control plane nodes (or the loadbalancer, if used for the Talos API endpoint).:
talosctl bootstrap --nodes 192.168.0.2
The bootstrap operation should only be called ONCE and only on a SINGLE control plane node!
At this point, Talos will form an etcd
cluster, generate all of the core Kubernetes assets, and start the Kubernetes control plane components.
After a few moments, you will be able to download your Kubernetes client configuration and get started:
talosctl kubeconfig
Running this command will add (merge) you new cluster into your local Kubernetes configuration.
If you would prefer the configuration to not be merged into your default Kubernetes configuration file, pass in a filename:
talosctl kubeconfig alternative-kubeconfig
You should now be able to connect to Kubernetes and see your nodes:
kubectl get nodes
And use talosctl to explore your cluster:
talosctl -n <NODEIP> dashboard
For a list of all the commands and operations that talosctl
provides, see the CLI reference.
5 - System Requirements
Minimum Requirements
Role | Memory | Cores | System Disk |
---|---|---|---|
Control Plane | 2 GiB | 2 | 10 GiB |
Worker | 1 GiB | 1 | 10 GiB |
Recommended
Role | Memory | Cores | System Disk |
---|---|---|---|
Control Plane | 4 GiB | 4 | 100 GiB |
Worker | 2 GiB | 2 | 100 GiB |
These requirements are similar to that of Kubernetes.
Storage
Talos Linux itself only requires less than 100 MB of disk space, but the EPHEMERAL partition is used to store pulled images, container work directories, and so on. Thus a minimum is 10 GiB of disk space is required. 100 GiB is desired. Note, however, that because Talos Linux assumes complete control of the disk it is installed on, so that it can control the partition table for image based upgrades, you cannot partition the rest of the disk for use by workloads.
Thus it is recommended to install Talos Linux on a small, dedicated disk - using a Terabyte sized SSD for the Talos install disk would be wasteful. Sidero Labs recommends having separate disks (apart from the Talos install disk) to be used for storage.
6 - What's New in Talos 1.10.0
See also upgrade notes for important changes.
TBD
7 - Support Matrix
Talos Version | 1.10 | 1.9 |
---|---|---|
Release Date | 2024-04-15 (TBD) | 2024-12-17 (1.9.0) |
End of Community Support | 1.11.0 release (2025-08-15, TBD) | 1.10.0 release (2024-04-15, TBD) |
Enterprise Support | offered by Sidero Labs Inc. | offered by Sidero Labs Inc. |
Kubernetes | 1.33, 1.32, 1.31, 1.30, 1.29, 1.28 | 1.32, 1.31, 1.30, 1.29, 1.28, 1.27 |
NVIDIA Drivers | 550.x.x (PRODUCTION), 535.x.x (LTS) | 550.x.x (PRODUCTION), 535.x.x (LTS) |
Architecture | amd64, arm64 | amd64, arm64 |
Platforms | ||
- cloud | Akamai, AWS, GCP, Azure, CloudStack, Digital Ocean, Exoscale, Hetzner, OpenNebula, OpenStack, Oracle Cloud, Scaleway, Vultr, Upcloud | Akamai, AWS, GCP, Azure, CloudStack, Digital Ocean, Exoscale, Hetzner, OpenNebula, OpenStack, Oracle Cloud, Scaleway, Vultr, Upcloud |
- bare metal | x86: BIOS, UEFI, SecureBoot; arm64: UEFI, SecureBoot; boot: ISO, PXE, disk image | x86: BIOS, UEFI; arm64: UEFI; boot: ISO, PXE, disk image |
- virtualized | VMware, Hyper-V, KVM, Proxmox, Xen | VMware, Hyper-V, KVM, Proxmox, Xen |
- SBCs | Banana Pi M64, Jetson Nano, Libre Computer Board ALL-H3-CC, Nano Pi R4S, Pine64, Pine64 Rock64, Radxa ROCK Pi 4c, Radxa Rock4c+, Raspberry Pi 4B, Raspberry Pi Compute Module 4, Turing RK1 | Banana Pi M64, Jetson Nano, Libre Computer Board ALL-H3-CC, Nano Pi R4S, Orange Pi R1 Plus LTS, Pine64, Pine64 Rock64, Radxa ROCK Pi 4c, Raspberry Pi 4B, Raspberry Pi Compute Module 4, Turing RK1 |
- local | Docker, QEMU | Docker, QEMU |
Omni | ||
Omni | >= 0.45.0 | >= 0.45.0 |
Cluster API | ||
CAPI Bootstrap Provider Talos | >= 0.6.7 | >= 0.6.7 |
CAPI Control Plane Provider Talos | >= 0.5.8 | >= 0.5.8 |
Sidero | >= 0.6.5 | >= 0.6.5 |
Platform Tiers
- Tier 1: Automated tests, high-priority fixes.
- Tier 2: Tested from time to time, medium-priority bugfixes.
- Tier 3: Not tested by core Talos team, community tested.
Tier 1
- Metal
- AWS
- Azure
- GCP
Tier 2
- Digital Ocean
- OpenStack
- VMWare
Tier 3
- Akamai
- CloudStack
- Exoscale
- Hetzner
- nocloud
- OpenNebula
- Oracle Cloud
- Scaleway
- Vultr
- Upcloud
8 - Troubleshooting
In this guide we assume that Talos is configured with default features enabled, such as Discovery Service and KubePrism. If these features are disabled, some of the troubleshooting steps may not apply or may need to be adjusted.
This guide is structured so that it can be followed step-by-step, skip sections which are not relevant to your issue.
Network Configuration
As Talos Linux is an API-based operating system, it is important to have networking configured so that the API can be accessed. Some information can be gathered from the Interactive Dashboard which is available on the machine console.
When running in the cloud the networking should be configured automatically.
Whereas when running on bare-metal it may need more specific configuration, see networking metal
configuration guide.
Talos API
The Talos API runs on port 50000. Control plane nodes should always serve the Talos API, while worker nodes require access to the control plane nodes to issue TLS certificates for the workers.
Firewall Issues
Make sure that the firewall is not blocking port 50000, and communication on ports 50000/50001 inside the cluster.
Client Configuration Issues
Make sure to use correct talosconfig
client configuration file matching your cluster.
See getting started for more information.
The most common issue is that talosctl gen config
writes talosconfig
to the file in the current directory, while talosctl
by default picks up the configuration from the default location (~/.talos/config
).
The path to the configuration file can be specified with --talosconfig
flag to talosctl
.
Conflict on Kubernetes and Host Subnets
If talosctl
returns an error saying that certificate IPs are empty, it might be due to a conflict between Kubernetes and host subnets.
The Talos API runs on the host network, but it automatically excludes Kubernetes pod & network subnets from the useable set of addresses.
Talos default machine configuration specifies the following Kubernetes pod and subnet IPv4 CIDRs: 10.244.0.0/16
and 10.96.0.0/12
.
If the host network is configured with one of these subnets, change the machine configuration to use a different subnet.
Wrong Endpoints
The talosctl
CLI connects to the Talos API via the specified endpoints, which should be a list of control plane machine addresses.
The client will automatically retry on other endpoints if there are unavailable endpoints.
Worker nodes should not be used as the endpoint, as they are not able to forward request to other nodes.
The VIP should never be used as Talos API endpoint.
TCP Loadbalancer
When using a TCP loadbalancer, make sure the loadbalancer endpoint is included in the .machine.certSANs
list in the machine configuration.
System Requirements
If minimum system requirements are not met, this might manifest itself in various ways, such as random failures when starting services, or failures to pull images from the container registry.
Running Health Checks
Talos Linux provides a set of basic health checks with talosctl health
command which can be used to check the health of the cluster.
In the default mode, talosctl health
uses information from the discovery to get the information about cluster members.
This can be overridden with command line flags --control-plane-nodes
and --worker-nodes
.
Gathering Logs
While the logs and state of the system can be queried via the Talos API, it is often useful to gather the logs from all nodes in the cluster, and analyze them offline.
The talosctl support
command can be used to gather logs and other information from the nodes specified with --nodes
flag (multiple nodes are supported).
Discovery and Cluster Membership
Talos Linux uses Discovery Service to discover other nodes in the cluster.
The list of members on each machine should be consistent: talosctl -n <IP> get members
.
Some Members are Missing
Ensure connectivity to the discovery service (default is discovery.talos.dev:443
), and that the discovery registry is not disabled.
Duplicate Members
Don’t use same base secrets to generate machine configuration for multiple clusters, as some secrets are used to identify members of the same cluster. So if the same machine configuration (or secrets) are used to repeatedly create and destroy clusters, the discovery service will see the same nodes as members of different clusters.
Removed Members are Still Present
Talos Linux removes itself from the discovery service when it is reset. If the machine was not reset, it might show up as a member of the cluster for the maximum TTL of the discovery service (30 minutes), and after that it will be automatically removed.
etcd
Issues
etcd
is the distributed key-value store used by Kubernetes to store its state.
Talos Linux provides automation to manage etcd
members running on control plane nodes.
If etcd
is not healthy, the Kubernetes API server will not be able to function correctly.
It is always recommended to run an odd number of etcd
members, as with 3 or more members it provides fault tolerance for less than quorum member failures.
Common troubleshooting steps:
- check
etcd
service state withtalosctl -n IP service etcd
for each control plane node - check
etcd
membership on each control plane node withtalosctl -n IP etcd member list
- check
etcd
logs withtalosctl -n IP logs etcd
- check
etcd
alarms withtalosctl -n IP etcd alarm list
All etcd
Services are Stuck in Pre
State
Make sure that a single member was bootstrapped.
Check that the machine is able to pull the etcd
container image, check talosctl dmesg
for messages starting with retrying:
prefix.
Some etcd
Services are Stuck in Pre
State
Make sure traffic is not blocked on port 2380 between controlplane nodes.
Check that etcd
quorum is not lost.
Check that all control plane nodes are reported in talosctl get members
output.
etcd
Reports and Alarm
See etcd maintenance guide.
etcd
Quorum is Lost
See disaster recovery guide.
Other Issues
etcd
will only run on control plane nodes.
If a node is designated as a worker node, you should not expect etcd
to be running on it.
When a node boots for the first time, the etcd
data directory (/var/lib/etcd
) is empty, and it will only be populated when etcd
is launched.
If the etcd
service is crashing and restarting, check its logs with talosctl -n <IP> logs etcd
.
The most common reasons for crashes are:
- wrong arguments passed via
extraArgs
in the configuration; - booting Talos on non-empty disk with an existing Talos installation,
/var/lib/etcd
contains data from the old cluster.
kubelet
and Kubernetes Node Issues
The kubelet
service should be running on all Talos nodes, and it is responsible for running Kubernetes pods,
static pods (including control plane components), and registering the node with the Kubernetes API server.
If the kubelet
doesn’t run on a control plane node, it will block the control plane components from starting.
The node will not be registered in Kubernetes until the Kubernetes API server is up and initial Kubernetes manifests are applied.
kubelet
is not running
Check that kubelet
image is available (talosctl image ls --namespace system
).
Check kubelet
logs with talosctl -n IP logs kubelet
for startup errors:
- make sure Kubernetes version is supported with this Talos release
- make sure
kubelet
extra arguments and extra configuration supplied with Talos machine configuration is valid
Talos Complains about Node Not Found
kubelet
hasn’t yet registered the node with the Kubernetes API server, this is expected during initial cluster bootstrap, the error will go away.
If the message persists, check Kubernetes API health.
The Kubernetes controller manager (kube-controller-manager
) is responsible for monitoring the certificate
signing requests (CSRs) and issuing certificates for each of them.
The kubelet
is responsible for generating and submitting the CSRs for its
associated node.
The state of any CSRs can be checked with kubectl get csr
:
$ kubectl get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-jcn9j 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-p6b9q 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-sw6rm 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
csr-vlghg 14m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:q9pyzr Approved,Issued
kubectl get nodes
Reports Wrong Internal IP
Configure the correct internal IP address with .machine.kubelet.nodeIP
kubectl get nodes
Reports Wrong External IP
Talos Linux doesn’t manage the external IP, it is managed with the Kubernetes Cloud Controller Manager.
kubectl get nodes
Reports Wrong Node Name
By default, the Kubernetes node name is derived from the hostname. Update the hostname using the machine configuration, cloud configuration, or via DHCP server.
Node Is Not Ready
A Node in Kubernetes is marked as Ready
only once its CNI is up.
It takes a minute or two for the CNI images to be pulled and for the CNI to start.
If the node is stuck in this state for too long, check CNI pods and logs with kubectl
.
Usually, CNI-related resources are created in kube-system
namespace.
For example, for the default Talos Flannel CNI:
$ kubectl -n kube-system get pods
NAME READY STATUS RESTARTS AGE
...
kube-flannel-25drx 1/1 Running 0 23m
kube-flannel-8lmb6 1/1 Running 0 23m
kube-flannel-gl7nx 1/1 Running 0 23m
kube-flannel-jknt9 1/1 Running 0 23m
...
Duplicate/Stale Nodes
Talos Linux doesn’t remove Kubernetes nodes automatically, so if a node is removed from the cluster, it will still be present in Kubernetes.
Remove the node from Kubernetes with kubectl delete node <node-name>
.
Talos Complains about Certificate Errors on kubelet
API
This error might appear during initial cluster bootstrap, and it will go away once the Kubernetes API server is up and the node is registered.
The example of Talos logs:
[talos] controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: Get \"https://127.0.0.1:10250/pods/?timeout=30s\": remote error: tls: internal error"}
By default configuration, kubelet
issues a self-signed server certificate, but when rotate-server-certificates
feature is enabled,
kubelet
issues its certificate using kube-apiserver
.
Make sure the kubelet
CSR is approved by the Kubernetes API server.
In either case, this error is not critical, as it only affects reporting of the pod status to Talos Linux.
Kubernetes Control Plane
The Kubernetes control plane consists of the following components:
kube-apiserver
- the Kubernetes API serverkube-controller-manager
- the Kubernetes controller managerkube-scheduler
- the Kubernetes scheduler
Optionally, kube-proxy
runs as a DaemonSet to provide pod-to-service communication.
coredns
provides name resolution for the cluster.
CNI is not part of the control plane, but it is required for Kubernetes pods using pod networking.
Troubleshooting should always start with kube-apiserver
, and then proceed to other components.
Talos Linux configures kube-apiserver
to talk to the etcd
running on the same node, so etcd
must be healthy before kube-apiserver
can start.
The kube-controller-manager
and kube-scheduler
are configured to talk to the kube-apiserver
on the same node, so they will not start until kube-apiserver
is healthy.
Control Plane Static Pods
Talos should generate the static pod definitions for the Kubernetes control plane as resources:
$ talosctl -n <IP> get staticpods
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 k8s StaticPod kube-apiserver 1
172.20.0.2 k8s StaticPod kube-controller-manager 1
172.20.0.2 k8s StaticPod kube-scheduler 1
Talos should report that the static pod definitions are rendered for the kubelet
:
$ talosctl -n <IP> dmesg | grep 'rendered new'
172.20.0.2: user: warning: [2023-04-26T19:17:52.550527204Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-apiserver"}
172.20.0.2: user: warning: [2023-04-26T19:17:52.552186204Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-controller-manager"}
172.20.0.2: user: warning: [2023-04-26T19:17:52.554607204Z]: [talos] rendered new static pod {"component": "controller-runtime", "controller": "k8s.StaticPodServerController", "id": "kube-scheduler"}
If the static pod definitions are not rendered, check etcd
and kubelet
service health (see above)
and the controller runtime logs (talosctl logs controller-runtime
).
Control Plane Pod Status
Initially the kube-apiserver
component will not be running, and it takes some time before it becomes fully up
during bootstrap (image should be pulled from the Internet, etc.)
The status of the control plane components on each of the control plane nodes can be checked with talosctl containers -k
:
$ talosctl -n <IP> containers --kubernetes
NODE NAMESPACE ID IMAGE PID STATUS
172.20.0.2 k8s.io kube-system/kube-apiserver-talos-default-controlplane-1 registry.k8s.io/pause:3.2 2539 SANDBOX_READY
172.20.0.2 k8s.io └─ kube-system/kube-apiserver-talos-default-controlplane-1:kube-apiserver:51c3aad7a271 registry.k8s.io/kube-apiserver:v1.33.0 2572 CONTAINER_RUNNING
The logs of the control plane components can be checked with talosctl logs --kubernetes
(or with -k
as a shorthand):
talosctl -n <IP> logs -k kube-system/kube-apiserver-talos-default-controlplane-1:kube-apiserver:51c3aad7a271
If the control plane component reports error on startup, check that:
- make sure Kubernetes version is supported with this Talos release
- make sure extra arguments and extra configuration supplied with Talos machine configuration is valid
Kubernetes Bootstrap Manifests
As part of the bootstrap process, Talos injects bootstrap manifests into Kubernetes API server. There are two kinds of these manifests: system manifests built-in into Talos and extra manifests downloaded (custom CNI, extra manifests in the machine config):
$ talosctl -n <IP> get manifests
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 controlplane Manifest 00-kubelet-bootstrapping-token 1
172.20.0.2 controlplane Manifest 01-csr-approver-role-binding 1
172.20.0.2 controlplane Manifest 01-csr-node-bootstrap 1
172.20.0.2 controlplane Manifest 01-csr-renewal-role-binding 1
172.20.0.2 controlplane Manifest 02-kube-system-sa-role-binding 1
172.20.0.2 controlplane Manifest 03-default-pod-security-policy 1
172.20.0.2 controlplane Manifest 05-https://docs.projectcalico.org/manifests/calico.yaml 1
172.20.0.2 controlplane Manifest 10-kube-proxy 1
172.20.0.2 controlplane Manifest 11-core-dns 1
172.20.0.2 controlplane Manifest 11-core-dns-svc 1
172.20.0.2 controlplane Manifest 11-kube-config-in-cluster 1
Details of each manifest can be queried by adding -o yaml
:
$ talosctl -n <IP> get manifests 01-csr-approver-role-binding --namespace=controlplane -o yaml
node: 172.20.0.2
metadata:
namespace: controlplane
type: Manifests.kubernetes.talos.dev
id: 01-csr-approver-role-binding
version: 1
phase: running
spec:
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system-bootstrap-approve-node-client-csr
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:certificates.k8s.io:certificatesigningrequests:nodeclient
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:bootstrappers
Other Control Plane Components
Once the Kubernetes API server is up, other control plane components issues can be troubleshooted with kubectl
:
kubectl get nodes -o wide
kubectl get pods -o wide --all-namespaces
kubectl describe pod -n NAMESPACE POD
kubectl logs -n NAMESPACE POD
Kubernetes API
The Kubernetes API client configuration (kubeconfig
) can be retrieved using Talos API with talosctl -n <IP> kubeconfig
command.
Talos Linux mostly doesn’t depend on the Kubernetes API endpoint for the cluster, but Kubernetes API endpoint should be configured
correctly for external access to the cluster.
Kubernetes Control Plane Endpoint
The Kubernetes control plane endpoint is the single canonical URL by which the
Kubernetes API is accessed.
Especially with high-availability (HA) control planes, this endpoint may point to a load balancer or a DNS name which may
have multiple A
and AAAA
records.
Like Talos’ own API, the Kubernetes API uses mutual TLS, client
certs, and a common Certificate Authority (CA).
Unlike general-purpose websites, there is no need for an upstream CA, so tools
such as cert-manager, Let’s Encrypt, or products such
as validated TLS certificates are not required.
Encryption, however, is, and hence the URL scheme will always be https://
.
By default, the Kubernetes API server in Talos runs on port 6443.
As such, the control plane endpoint URLs for Talos will almost always be of the form
https://endpoint:6443
.
(The port, since it is not the https
default of 443
is required.)
The endpoint
above may be a DNS name or IP address, but it should be
directed to the set of all controlplane nodes, as opposed to a
single one.
As mentioned above, this can be achieved by a number of strategies, including:
- an external load balancer
- DNS records
- Talos-builtin shared IP (VIP)
- BGP peering of a shared IP (such as with kube-vip)
Using a DNS name here is a good idea, since it allows any other option, while offering a layer of abstraction. It allows the underlying IP addresses to change without impacting the canonical URL.
Unlike most services in Kubernetes, the API server runs with host networking, meaning that it shares the network namespace with the host. This means you can use the IP address(es) of the host to refer to the Kubernetes API server.
For availability of the API, it is important that any load balancer be aware of the health of the backend API servers, to minimize disruptions during common node operations like reboots and upgrades.
Miscellaneous
Checking Controller Runtime Logs
Talos runs a set of controllers which operate on resources to build and support machine operations.
Some debugging information can be queried from the controller logs with talosctl logs controller-runtime
:
talosctl -n <IP> logs controller-runtime
Controllers continuously run a reconcile loop, so at any time, they may be starting, failing, or restarting. This is expected behavior.
If there are no new messages in the controller-runtime
log, it means that the controllers have successfully finished reconciling, and that the current system state is the desired system state.