This is the multi-page printable view of this section. Click here to print.
Learn More
1 - Philosophy
Distributed
Talos is intended to be operated in a distributed manner: it is built for a high-availability dataplane first.
Its etcd
cluster is built in an ad-hoc manner, with each appointed node joining on its own directive (with proper security validations enforced, of course).
Like Kubernetes, workloads are intended to be distributed across any number of compute nodes.
There should be no single points of failure, and the level of required coordination is as low as each platform allows.
Immutable
Talos takes immutability very seriously. Talos itself, even when installed on a disk, always runs from a SquashFS image, meaning that even if a directory is mounted to be writable, the image itself is never modified. All images are signed and delivered as single, versioned files. We can always run integrity checks on our image to verify that it has not been modified.
While Talos does allow a few, highly-controlled write points to the filesystem, we strive to make them as non-unique and non-critical as possible. We call the writable partition the “ephemeral” partition precisely because we want to make sure none of us ever uses it for unique, non-replicated, non-recreatable data. Thus, if all else fails, we can always wipe the disk and get back up and running.
Minimal
We are always trying to reduce Talos’ footprint. Because nearly the entire OS is built from scratch in Go, we are in a good position. We have no shell. We have no SSH. We have none of the GNU utilities, not even a rollup tool such as busybox. Everything in Talos is there because it is necessary, and nothing is included which isn’t.
As a result, the OS right now produces a SquashFS image size of less than 80 MB.
Ephemeral
Everything Talos writes to its disk is either replicated or reconstructable. Since the controlplane is highly available, the loss of any node will cause neither service disruption nor loss of data. No writes are even allowed to the vast majority of the filesystem. We even call the writable partition “ephemeral” to keep this idea always in focus.
Secure
Talos has always been designed with security in mind. With its immutability, its minimalism, its signing, and its componenture, we are able to simply bypass huge classes of vulnerabilities. Moreover, because of the way we have designed Talos, we are able to take advantage of a number of additional settings, such as the recommendations of the Kernel Self Protection Project (kspp) and completely disabling dynamic modules.
There are no passwords in Talos. All networked communication is encrypted and key-authenticated. The Talos certificates are short-lived and automatically-rotating. Kubernetes is always constructed with its own separate PKI structure which is enforced.
Declarative
Everything which can be configured in Talos is done through a single YAML manifest. There is no scripting and no procedural steps. Everything is defined by the one declarative YAML file. This configuration includes that of both Talos itself and the Kubernetes which it forms.
This is achievable because Talos is tightly focused to do one thing: run Kubernetes, in the easiest, most secure, most reliable way it can.
Not based on X distro
Talos Linux isn’t based on any other distribution. We think of ourselves as being the second-generation of container-optimised operating systems, where things like CoreOS, Flatcar, and Rancher represent the first generation (but the technology is not derived from any of those.)
Talos Linux is actually a ground-up rewrite of the userspace, from PID 1.
We run the Linux kernel, but everything downstream of that is our own custom
code, written in Go, rigorously-tested, and published as an immutable,
integrated image.
The Linux kernel launches what we call machined
, for instance, not systemd
.
There is no systemd
on our system.
There are no GNU utilities, no shell, no SSH, no packages, nothing you could associate with
any other distribution.
An Operating System designed for Kubernetes
Technically, Talos Linux installs to a computer like any other operating system. Unlike other operating systems, Talos is not meant to run alone, on a single machine. A design goal of Talos Linux is eliminating the management of individual nodes as much as possible. In order to do that, Talos Linux operates as a cluster of machines, with lots of checking and coordination between them, at all levels.
There is only a cluster. Talos is meant to do one thing: maintain a Kubernetes cluster, and it does this very, very well.
The entirety of the configuration of any machine is specified by a single configuration file, which can often be the same configuration file used across many machines. Much like a biological system, if some component misbehaves, just cut it out and let a replacement grow. Rebuilds of Talos are remarkably fast, whether they be new machines, upgrades, or reinstalls. Never get hung up on an individual machine.
2 - Architecture
Talos is designed to be atomic in deployment and modular in composition.
It is atomic in that the entirety of Talos is distributed as a single, self-contained image, which is versioned, signed, and immutable.
It is modular in that it is composed of many separate components which have clearly defined gRPC interfaces which facilitate internal flexibility and external operational guarantees.
All of the main Talos components communicate with each other by gRPC, through a socket on the local machine. This imposes a clear separation of concerns and ensures that changes over time which affect the interoperation of components are a part of the public git record. The benefit is that each component may be iterated and changed as its needs dictate, so long as the external API is controlled. This is a key component in reducing coupling and maintaining modularity.
File system partitions
Talos uses these partitions with the following labels:
- EFI - stores EFI boot data.
- BIOS - used for GRUB’s second stage boot.
- BOOT - used for the boot loader, stores initramfs and kernel data.
- META - stores metadata about the talos node, such as node id’s.
- STATE - stores machine configuration, node identity data for cluster discovery and KubeSpan info
- EPHEMERAL - stores ephemeral state information, mounted at
/var
The File System
One of the unique design decisions in Talos is the layout of the root file system. There are three “layers” to the Talos root file system. At its core the rootfs is a read-only squashfs. The squashfs is then mounted as a loop device into memory. This provides Talos with an immutable base.
The next layer is a set of tmpfs
file systems for runtime specific needs.
Aside from the standard pseudo file systems such as /dev
, /proc
, /run
, /sys
and /tmp
, a special /system
is created for internal needs.
One reason for this is that we need special files such as /etc/hosts
, and /etc/resolv.conf
to be writable (remember that the rootfs is read-only).
For example, at boot Talos will write /system/etc/hosts
and then bind mount it over /etc/hosts
.
This means that instead of making all of /etc
writable, Talos only makes very specific files writable under /etc
.
All files under /system
are completely recreated on each boot.
For files and directories that need to persist across boots, Talos creates overlayfs
file systems.
The /etc/kubernetes
is a good example of this.
Directories like this are overlayfs
backed by an XFS file system mounted at /var
.
The /var
directory is owned by Kubernetes with the exception of the above overlayfs
file systems.
This directory is writable and used by etcd
(in the case of control plane nodes), the kubelet, and the CRI (containerd).
Its content survives machine reboots, but it is wiped and lost on machine upgrades and resets, unless the
--preserve
option of talosctl upgrade
or the
--system-labels-to-wipe
option of talosctl reset
is used.
3 - Components
In this section, we discuss the various components that underpin Talos.
Components
Talos Linux and Kubernetes are tightly integrated.
In the following, the focus is on the Talos Linux specific components.
Component | Description |
---|---|
apid | When interacting with Talos, the gRPC API endpoint you interact with directly is provided by apid . apid acts as the gateway for all component interactions and forwards the requests to machined . |
containerd | An industry-standard container runtime with an emphasis on simplicity, robustness, and portability. To learn more, see the containerd website. |
machined | Talos replacement for the traditional Linux init-process. Specially designed to run Kubernetes and does not allow starting arbitrary user services. |
kernel | The Linux kernel included with Talos is configured according to the recommendations outlined in the Kernel Self Protection Project. |
trustd | To run and operate a Kubernetes cluster, a certain level of trust is required. Based on the concept of a ‘Root of Trust’, trustd is a simple daemon responsible for establishing trust within the system. |
udevd | Implementation of eudev into machined . eudev is Gentoo’s fork of udev, systemd’s device file manager for the Linux kernel. It manages device nodes in /dev and handles all user space actions when adding or removing devices. To learn more, see the Gentoo Wiki. |
apid
When interacting with Talos, the gRPC api endpoint you will interact with directly is apid
.
Apid acts as the gateway for all component interactions.
Apid provides a mechanism to route requests to the appropriate destination when running on a control plane node.
We’ll use some examples below to illustrate what apid
is doing.
When a user wants to interact with a Talos component via talosctl
, there are two flags that control the interaction with apid
.
The -e | --endpoints
flag specifies which Talos node ( via apid
) should handle the connection.
Typically this is a public-facing server.
The -n | --nodes
flag specifies which Talos node(s) should respond to the request.
If --nodes
is omitted, the first endpoint will be used.
Note: Typically, there will be an
endpoint
already defined in the Talos config file. Optionally,nodes
can be included here as well.
For example, if a user wants to interact with machined
, a command like talosctl -e cluster.talos.dev memory
may be used.
$ talosctl -e cluster.talos.dev memory
NODE TOTAL USED FREE SHARED BUFFERS CACHE AVAILABLE
cluster.talos.dev 7938 1768 2390 145 53 3724 6571
In this case, talosctl
is interacting with apid
running on cluster.talos.dev
and forwarding the request to the machined
api.
If we wanted to extend our example to retrieve memory
from another node in our cluster, we could use the command talosctl -e cluster.talos.dev -n node02 memory
.
$ talosctl -e cluster.talos.dev -n node02 memory
NODE TOTAL USED FREE SHARED BUFFERS CACHE AVAILABLE
node02 7938 1768 2390 145 53 3724 6571
The apid
instance on cluster.talos.dev
receives the request and forwards it to apid
running on node02
, which forwards the request to the machined
api.
We can further extend our example to retrieve memory
for all nodes in our cluster by appending additional -n node
flags or using a comma separated list of nodes ( -n node01,node02,node03
):
$ talosctl -e cluster.talos.dev -n node01 -n node02 -n node03 memory
NODE TOTAL USED FREE SHARED BUFFERS CACHE AVAILABLE
node01 7938 871 4071 137 49 2945 7042
node02 257844 14408 190796 18138 49 52589 227492
node03 257844 1830 255186 125 49 777 254556
The apid
instance on cluster.talos.dev
receives the request and forwards it to node01
, node02
, and node03
, which then forwards the request to their local machined
api.
containerd
Containerd provides the container runtime to launch workloads on Talos and Kubernetes.
Talos services are namespaced under the system
namespace in containerd, whereas the Kubernetes services are namespaced under the k8s.io
namespace.
machined
A common theme throughout the design of Talos is minimalism.
We believe strongly in the UNIX philosophy that each program should do one job well.
The init
included in Talos is one example of this, and we are calling it “machined
”.
We wanted to create a focused init
that had one job - run Kubernetes.
To that extent, machined
is relatively static in that it does not allow for arbitrary user-defined services.
Only the services necessary to run Kubernetes and manage the node are available.
This includes:
- containerd
- etcd
- kubelet
- networkd
- trustd
- udevd
The machined
process handles all machine configuration, API handling, resource and controller management.
kernel
The Linux kernel included with Talos is configured according to the recommendations outlined in the Kernel Self Protection Project (KSPP).
trustd
Security is one of the highest priorities within Talos. To run a Kubernetes cluster, a certain level of trust is required to operate a cluster. For example, orchestrating the bootstrap of a highly available control plane requires sensitive PKI data distribution.
To that end, we created trustd
.
Based on a Root of Trust concept, trustd
is a simple daemon responsible for establishing trust within the system.
Once trust is established, various methods become available to the trustee.
For example, it can accept a write request from another node to place a file on disk.
Additional methods and capabilities will be added to the trustd
component to support new functionality in the rest of the Talos environment.
udevd
Udevd handles the kernel device notifications and sets up the necessary links in /dev
.
4 - Control Plane
This guide provides information about the Kubernetes control plane, and details on how Talos runs and bootstraps the Kubernetes control plane.
What is a control plane node?
A control plane node is a node which:
- runs etcd, the Kubernetes database
- runs the Kubernetes control plane
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- serves as an administrative proxy to the worker nodes
These nodes are critical to the operation of your cluster. Without control plane nodes, Kubernetes will not respond to changes in the system, and certain central services may not be available.
Talos nodes which have .machine.type
of controlplane
are control plane nodes.
(check via talosctl get member
)
Control plane nodes are tainted by default to prevent workloads from being scheduled onto them. This is both to protect the control plane from workloads consuming resources and starving the control plane processes, and also to reduce the risk of a vulnerability exposes the control plane’s credentials to a workload.
The Control Plane and Etcd
A critical design concept of Kubernetes (and Talos) is the etcd
database.
Properly managed (which Talos Linux does), etcd
should never have split brain or noticeable down time.
In order to do this, etcd
maintains the concept of “membership” and of
“quorum”.
To perform any operation, read or write, the database requires
quorum.
That is, a majority of members must agree on the current leader, and absenteeism (members that are down, or not reachable)
counts as a negative.
For example, if there are three members, at least two out
of the three must agree on the current leader.
If two disagree or fail to answer, the etcd
database will lock itself
until quorum is achieved in order to protect the integrity of
the data.
This design means that having two controlplane nodes is worse than having only one, because if either goes down, your database will lock (and the chance of one of two nodes going down is greater than the chance of just a single node going down). Similarly, a 4 node etcd cluster is worse than a 3 node etcd cluster - a 4 node cluster requires 3 nodes to be up to achieve quorum (in order to have a majority), while the 3 node cluster requires 2 nodes: i.e. both can support a single node failure and keep running - but the chance of a node failing in a 4 node cluster is higher than that in a 3 node cluster.
Another note about etcd: due to the need to replicate data amongst members, performance of etcd decreases as the cluster scales. A 5 node cluster can commit about 5% less writes per second than a 3 node cluster running on the same hardware.
Recommendations for your control plane
- Run your clusters with three or five control plane nodes. Three is enough for most use cases. Five will give you better availability (in that it can tolerate two node failures simultaneously), but cost you more both in the number of nodes required, and also as each node may require more hardware resources to offset the performance degradation seen in larger clusters.
- Implement good monitoring and put processes in place to deal with a failed node in a timely manner (and test them!)
- Even with robust monitoring and procedures for replacing failed nodes in place, backup etcd and your control plane node configuration to guard against unforeseen disasters.
- Monitor the performance of your etcd clusters. If etcd performance is slow, vertically scale the nodes, not the number of nodes.
- If a control plane node fails, remove it first, then add the replacement node. (This ensures that the failed node does not “vote” when adding in the new node, minimizing the chances of a quorum violation.)
- If replacing a node that has not failed, add the new one, then remove the old.
Bootstrapping the Control Plane
Every new cluster must be bootstrapped only once, which is achieved by telling a single control plane node to initiate the bootstrap.
Bootstrapping itself does not do anything with Kubernetes.
Bootstrapping only tells etcd
to form a cluster, so don’t judge the success of
a bootstrap by the failure of Kubernetes to start.
Kubernetes relies on etcd
, so bootstrapping is required, but it is not
sufficient for Kubernetes to start.
If your Kubernetes cluster fails to form for other reasons (say, a bad
configuration option or unavailable container repository), if the bootstrap API
call returns successfully, you do NOT need to bootstrap again:
just fix the config or let Kubernetes retry.
High-level Overview
Talos cluster bootstrap flow:
- The
etcd
service is started on control plane nodes. Instances ofetcd
on control plane nodes build theetcd
cluster. - The
kubelet
service is started. - Control plane components are started as static pods via the
kubelet
, and thekube-apiserver
component connects to the local (running on the same node)etcd
instance. - The
kubelet
issues client certificate using the bootstrap token using the control plane endpoint (viakube-apiserver
andkube-controller-manager
). - The
kubelet
registers the node in the API server. - Kubernetes control plane schedules pods on the nodes.
Cluster Bootstrapping
All nodes start the kubelet
service.
The kubelet
tries to contact the control plane endpoint, but as it is not up yet, it keeps retrying.
One of the control plane nodes is chosen as the bootstrap node, and promoted using the bootstrap API (talosctl bootstrap
).
The bootstrap node initiates the etcd
bootstrap process by initializing etcd
as the first member of the cluster.
Once
etcd
is bootstrapped, the bootstrap node has no special role and acts the same way as other control plane nodes.
Services etcd
on non-bootstrap nodes try to get Endpoints
resource via control plane endpoint, but that request fails as control plane endpoint is not up yet.
As soon as etcd
is up on the bootstrap node, static pod definitions for the Kubernetes control plane components (kube-apiserver
, kube-controller-manager
, kube-scheduler
) are rendered to disk.
The kubelet
service on the bootstrap node picks up the static pod definitions and starts the Kubernetes control plane components.
As soon as kube-apiserver
is launched, the control plane endpoint comes up.
The bootstrap node acquires an etcd
mutex and injects the bootstrap manifests into the API server.
The set of the bootstrap manifests specify the Kubernetes join token and kubelet CSR auto-approval.
The kubelet
service on all the nodes is now able to issue client certificates for themselves and register nodes in the API server.
Other bootstrap manifests specify additional resources critical for Kubernetes operations (i.e. CNI, PSP, etc.)
The etcd
service on non-bootstrap nodes is now able to discover other members of the etcd
cluster via the Kubernetes Endpoints
resource.
The etcd
cluster is now formed and consists of all control plane nodes.
All control plane nodes render static pod manifests for the control plane components. Each node now runs a full set of components to make the control plane HA.
The kubelet
service on worker nodes is now able to issue the client certificate and register itself with the API server.
Scaling Up the Control Plane
When new nodes are added to the control plane, the process is the same as the bootstrap process above: the etcd
service discovers existing members of the control plane via the
control plane endpoint, joins the etcd
cluster, and the control plane components are scheduled on the node.
Scaling Down the Control Plane
Scaling down the control plane involves removing a node from the cluster. The most critical part is making sure that the node which is being removed leaves the etcd cluster. The recommended way to do this is to use:
talosctl -n IP.of.node.to.remove reset
kubectl delete node
When using talosctl reset
command, the targeted control plane node leaves the etcd
cluster as part of the reset sequence, and its disks are erased.
Upgrading Talos on Control Plane Nodes
When a control plane node is upgraded, Talos leaves etcd
, wipes the system disk, installs a new version of itself, and reboots.
The upgraded node then joins the etcd
cluster on reboot.
So upgrading a control plane node is equivalent to scaling down the control plane node followed by scaling up with a new version of Talos.
5 - Image Factory
The Image Factory provides a way to download Talos Linux artifacts. Artifacts can be generated with customizations defined by a “schematic”. A schematic can be applied to any of the versions of Talos Linux offered by the Image Factory to produce a “model”.
The following assets are provided:
- ISO
kernel
,initramfs
, and kernel command line- UKI
- disk images in various formats (e.g. AWS, GCP, VMware, etc.)
- installer container images
The supported frontends are:
- HTTP
- PXE
- Container Registry
The official instance of Image Factory is available at https://factory.talos.dev.
See Boot Assets for an example of how to use the Image Factory to boot and upgrade Talos on different platforms. Full API documentation for the Image Factory is available at GitHub.
Schematics
Schematics are YAML files that define customizations to be applied to a Talos Linux image. Schematics can be applied to any of the versions of Talos Linux offered by the Image Factory to produce a “model”, which is a Talos Linux image with the customizations applied.
Schematics are content-addressable, that is, the content of the schematic is used to generate a unique ID. The schematic should be uploaded to the Image Factory first, and then the ID can be used to reference the schematic in a model.
Schematics can be generated using the Image Factory UI, or using the Image Factory API:
customization:
extraKernelArgs: # optional
- vga=791
meta: # optional, allows to set initial Talos META
- key: 0xa
value: "{}"
systemExtensions: # optional
officialExtensions: # optional
- siderolabs/gvisor
- siderolabs/amd-ucode
overlay: # optional
name: rpi_generic
image: siderolabs/sbc-raspberry-pi
options: # optional, any valid yaml, depends on the overlay implementation
data: "mydata"
The “vanilla” schematic is:
customization:
and has an ID of 376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba
.
The schematic can be applied by uploading it to the Image Factory:
curl -X POST --data-binary @schematic.yaml https://factory.talos.dev/schematics
As the schematic is content-addressable, the same schematic can be uploaded multiple times, and the Image Factory will return the same ID.
Models
Models are Talos Linux images with customizations applied. The inputs to generate a model are:
- schematic ID
- Talos Linux version
- model type (e.g. ISO, UKI, etc.)
- architecture (e.g. amd64, arm64)
- various model type specific options (e.g. disk image format, disk image size, etc.)
Frontends
Image Factory provides several frontends to retrieve models:
- HTTP frontend to download models (e.g. download an ISO or a disk image)
- PXE frontend to boot bare-metal machines (PXE script references kernel/initramfs from HTTP frontend)
- Registry frontend to fetch customized
installer
images (for initial Talos Linux installation and upgrades)
The links to different models are available in the Image Factory UI, and a full list of possible models is documented at GitHub.
In this guide we will provide a list of examples:
- amd64 ISO (for Talos v1.10.0-alpha.0, “vanilla” schematic) https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.10.0-alpha.0/metal-amd64.iso
- arm64 AWS image (for Talos v1.10.0-alpha.0, “vanilla” schematic) https://factory.talos.dev/image/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.10.0-alpha.0/aws-arm64.raw.xz
- amd64 PXE boot script (for Talos v1.10.0-alpha.0, “vanilla” schematic) https://pxe.factory.talos.dev/pxe/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba/v1.10.0-alpha.0/metal-amd64
- Talos
installer
image (for Talos v1.10.0-alpha.0, “vanilla” schematic, architecture is detected automatically):factory.talos.dev/installer/376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba:v1.10.0-alpha.0
The installer
image can be used to install Talos Linux on a bare-metal machine, or to upgrade an existing Talos Linux installation.
As the Talos version and schematic ID can be changed, via an upgrade process, the installer
image can be used to upgrade to any version of Talos Linux, or replace a set of installed system extensions.
UI
The Image Factory UI is available at https://factory.talos.dev. The UI provides a way to list supported Talos Linux versions, list of system extensions available for each release, and a way to generate schematic based on the selected system extensions.
The UI operations are equivalent to API operations.
Find Schematic ID from Talos Installation
Image Factory always appends “virtual” system extension with the version matching schematic ID used to generate the model. So, for any running Talos Linux instance the schematic ID can be found by looking at the list of system extensions:
$ talosctl get extensions
NAMESPACE TYPE ID VERSION NAME VERSION
runtime ExtensionStatus 0 1 schematic 376567988ad370138ad8b2698212367b8edcb69b5fd68c80be1f2ec7d603b4ba
Restrictions
Some models don’t include every customization of the schematic:
installer
andinitramfs
images only support system extensions (kernel args and META are ignored)kernel
assets don’t depend on the schematic
Other models have full support for all customizations:
- any disk image format
- ISO, PXE boot script
When installing Talos Linux using ISO/PXE boot, Talos will be installed on the disk using the installer
image, so the installer
image in the machine configuration
should be using the same schematic as the ISO/PXE boot image.
Some system extensions are not available for all Talos Linux versions, so an attempt to generate a model with an unsupported system extension will fail. List of supported Talos versions and supported system extensions for each version is available in the Image Factory UI and API.
Under the Hood
Image Factory is based on the Talos imager
container which provides both the Talos base boot assets, and the ability to generate custom assets based on a configuration.
Image Factory manages a set of imager
container images to acquire base Talos Linux boot assets (kernel
, initramfs
), a set of Talos Linux system extension images, and a set of schematics.
When a model is requested, Image Factory uses the imager
container to generate the requested assets based on the schematic and the Talos Linux version.
Security
Image Factory verifies signatures of all source container images fetched:
imager
container images (base boot assets)extensions
system extensions catalogsinstaller
contianer images (base installer layer)- Talos Linux system extension images
Internally, Image Factory caches generated boot assets and signs all cached images using a private key. Image Factory verifies the signature of the cached images before serving them to clients.
Image Factory signs generated installer
images, and verifies the signature of the installer
images before serving them to clients.
Image Factory does not provide a way to list all schematics, as schematics may contain sensitive information (e.g. private kernel boot arguments). As the schematic ID is content-addressable, it is not possible to guess the ID of a schematic without knowing the content of the schematic.
Running your own Image Factory
Image Factory can be deployed on-premises to provide in-house asset generation.
Image Factory requires following components:
- an OCI registry to store schematics (private)
- an OCI registry to store cached assets (private)
- an OCI registry to store
installer
images (should allow public read-only access) - a container image signing key: ECDSA P-256 private key in PEM format
Image Factory is configured using command line flags, use --help
to see a list of available flags.
Image Factory should be configured to use proper authentication to push to the OCI registries:
- by mounting proper credentials via
~/.docker/config.json
- by supplying
GITHUB_TOKEN
(forghcr.io
)
Image Factory performs HTTP redirects to the public registry endpoint for installer
images, so the public endpoint
should be available to Talos Linux machines to pull the installer
images.
6 - Controllers and Resources
Talos implements concepts of resources and controllers to facilitate internal operations of the operating system. Talos resources and controllers are very similar to Kubernetes resources and controllers, but there are some differences. The content of this document is not required to operate Talos, but it is useful for troubleshooting.
Starting with Talos 0.9, most of the Kubernetes control plane bootstrapping and operations is implemented via controllers and resources which allows Talos to be reactive to configuration changes, environment changes (e.g. time sync).
Resources
A resource captures a piece of system state. Each resource belongs to a “Type” which defines resource contents. Resource state can be split in two parts:
- metadata: fixed set of fields describing resource - namespace, type, ID, etc.
- spec: contents of the resource (depends on resource type).
Resource is uniquely identified by (namespace
, type
, id
).
Namespaces provide a way to avoid conflicts on duplicate resource IDs.
At the moment of this writing, all resources are local to the node and stored in memory.
So on every reboot resource state is rebuilt from scratch (the only exception is MachineConfig
resource which reflects current machine config).
Controllers
Controllers run as independent lightweight threads in Talos. The goal of the controller is to reconcile the state based on inputs and eventually update outputs.
A controller can have any number of resource types (and namespaces) as inputs.
In other words, it watches specified resources for changes and reconciles when these changes occur.
A controller might also have additional inputs: running reconcile on schedule, watching etcd
keys, etc.
A controller has a single output: a set of resources of fixed type in a fixed namespace. Only one controller can manage resource type in the namespace, so conflicts are avoided.
Querying Resources
Talos CLI tool talosctl
provides read-only access to the resource API which includes getting specific resource,
listing resources and watching for changes.
Talos stores resources describing resource types and namespaces in meta
namespace:
$ talosctl get resourcedefinitions
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 meta ResourceDefinition bootstrapstatuses.v1alpha1.talos.dev 1
172.20.0.2 meta ResourceDefinition etcdsecrets.secrets.talos.dev 1
172.20.0.2 meta ResourceDefinition kubernetescontrolplaneconfigs.config.talos.dev 1
172.20.0.2 meta ResourceDefinition kubernetessecrets.secrets.talos.dev 1
172.20.0.2 meta ResourceDefinition machineconfigs.config.talos.dev 1
172.20.0.2 meta ResourceDefinition machinetypes.config.talos.dev 1
172.20.0.2 meta ResourceDefinition manifests.kubernetes.talos.dev 1
172.20.0.2 meta ResourceDefinition manifeststatuses.kubernetes.talos.dev 1
172.20.0.2 meta ResourceDefinition namespaces.meta.cosi.dev 1
172.20.0.2 meta ResourceDefinition resourcedefinitions.meta.cosi.dev 1
172.20.0.2 meta ResourceDefinition rootsecrets.secrets.talos.dev 1
172.20.0.2 meta ResourceDefinition secretstatuses.kubernetes.talos.dev 1
172.20.0.2 meta ResourceDefinition services.v1alpha1.talos.dev 1
172.20.0.2 meta ResourceDefinition staticpods.kubernetes.talos.dev 1
172.20.0.2 meta ResourceDefinition staticpodstatuses.kubernetes.talos.dev 1
172.20.0.2 meta ResourceDefinition timestatuses.v1alpha1.talos.dev 1
$ talosctl get namespaces
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 meta Namespace config 1
172.20.0.2 meta Namespace controlplane 1
172.20.0.2 meta Namespace meta 1
172.20.0.2 meta Namespace runtime 1
172.20.0.2 meta Namespace secrets 1
Most of the time namespace flag (--namespace
) can be omitted, as ResourceDefinition
contains default
namespace which is used if no namespace is given:
$ talosctl get resourcedefinitions resourcedefinitions.meta.cosi.dev -o yaml
node: 172.20.0.2
metadata:
namespace: meta
type: ResourceDefinitions.meta.cosi.dev
id: resourcedefinitions.meta.cosi.dev
version: 1
phase: running
spec:
type: ResourceDefinitions.meta.cosi.dev
displayType: ResourceDefinition
aliases:
- resourcedefinitions
- resourcedefinition
- resourcedefinitions.meta
- resourcedefinitions.meta.cosi
- rd
- rds
printColumns: []
defaultNamespace: meta
Resource definition also contains type aliases which can be used interchangeably with canonical resource name:
$ talosctl get ns config
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 meta Namespace config 1
Output
Command talosctl get
supports following output modes:
table
(default) prints resource list as a tableyaml
prints pretty formatted resources with details, including full metadata spec. This format carries most details from the backend resource (e.g. comments inMachineConfig
resource)json
prints same information asyaml
, some additional details (e.g. comments) might be lost. This format is useful for automated processing with tools likejq
.
Watching Changes
If flag --watch
is appended to the talosctl get
command, the command switches to watch mode.
If list of resources was requested, talosctl
prints initial contents of the list and then appends resource information for every change:
$ talosctl get svc -w
NODE * NAMESPACE TYPE ID VERSION RUNNING HEALTHY
172.20.0.2 + runtime Service timed 2 true true
172.20.0.2 + runtime Service trustd 2 true true
172.20.0.2 + runtime Service udevd 2 true true
172.20.0.2 - runtime Service timed 2 true true
172.20.0.2 + runtime Service timed 1 true false
172.20.0.2 runtime Service timed 2 true true
Column *
specifies event type:
+
is created-
is deletedis updated
In YAML/JSON output, field event
is added to the resource representation to describe the event type.
Examples
Getting machine config:
$ talosctl get machineconfig -o yaml
node: 172.20.0.2
metadata:
namespace: config
type: MachineConfigs.config.talos.dev
id: v1alpha1
version: 2
phase: running
spec:
version: v1alpha1 # Indicates the schema used to decode the contents.
debug: false # Enable verbose logging to the console.
persist: true # Indicates whether to pull the machine config upon every boot.
# Provides machine specific configuration options.
...
Getting control plane static pod statuses:
$ talosctl get staticpodstatus
NODE NAMESPACE TYPE ID VERSION READY
172.20.0.2 controlplane StaticPodStatus kube-system/kube-apiserver-talos-default-controlplane-1 3 True
172.20.0.2 controlplane StaticPodStatus kube-system/kube-controller-manager-talos-default-controlplane-1 3 True
172.20.0.2 controlplane StaticPodStatus kube-system/kube-scheduler-talos-default-controlplane-1 4 True
Getting static pod definition for kube-apiserver
:
$ talosctl get sp kube-apiserver -n 172.20.0.2 -o yaml
node: 172.20.0.2
metadata:
namespace: controlplane
type: StaticPods.kubernetes.talos.dev
id: kube-apiserver
version: 3
phase: running
finalizers:
- k8s.StaticPodStatus("kube-apiserver")
spec:
apiVersion: v1
kind: Pod
metadata:
annotations:
talos.dev/config-version: "1"
talos.dev/secrets-version: "2"
...
Inspecting Controller Dependencies
Talos can report current dependencies between controllers and resources for debugging purposes:
$ talosctl inspect dependencies
digraph {
n1[label="config.K8sControlPlaneController",shape="box"];
n3[label="config.MachineTypeController",shape="box"];
n2[fillcolor="azure2",label="config:KubernetesControlPlaneConfigs.config.talos.dev",shape="note",style="filled"];
...
This outputs graph in graphviz
format which can be rendered to PNG with command:
talosctl inspect dependencies | dot -T png > deps.png
Graph can be enhanced by replacing resource types with actual resource instances:
talosctl inspect dependencies --with-resources | dot -T png > deps.png
7 - Networking Resources
Talos network configuration subsystem is powered by COSI. Talos translates network configuration from multiple sources: machine configuration, cloud metadata, network automatic configuration (e.g. DHCP) into COSI resources.
Network configuration and network state can be inspected using talosctl get
command.
Network machine configuration can be modified using talosctl edit mc
command (also variants talosctl patch mc
, talosctl apply-config
) without a reboot.
As API access requires network connection, --mode=try
can be used to test the configuration with automatic rollback to avoid losing network access to the node.
Resources
There are six basic network configuration items in Talos:
Address
(IP address assigned to the interface/link);Route
(route to a destination);Link
(network interface/link configuration);Resolver
(list of DNS servers);Hostname
(node hostname and domainname);TimeServer
(list of NTP servers).
Each network configuration item has two counterparts:
*Status
(e.g.LinkStatus
) describes the current state of the system (Linux kernel state);*Spec
(e.g.LinkSpec
) defines the desired configuration.
Resource | Status | Spec |
---|---|---|
Address | AddressStatus | AddressSpec |
Route | RouteStatus | RouteSpec |
Link | LinkStatus | LinkSpec |
Resolver | ResolverStatus | ResolverSpec |
Hostname | HostnameStatus | HostnameSpec |
TimeServer | TimeServerStatus | TimeServerSpec |
Status resources have aliases with the Status
suffix removed, so for example
AddressStatus
is also available as Address
.
Talos networking controllers reconcile the state so that *Status
equals the desired *Spec
.
Observing State
The current network configuration state can be observed by querying *Status
resources via
talosctl
:
$ talosctl get addresses
NODE NAMESPACE TYPE ID VERSION ADDRESS LINK
172.20.0.2 network AddressStatus eth0/172.20.0.2/24 1 172.20.0.2/24 eth0
172.20.0.2 network AddressStatus eth0/fe80::9804:17ff:fe9d:3058/64 2 fe80::9804:17ff:fe9d:3058/64 eth0
172.20.0.2 network AddressStatus flannel.1/10.244.4.0/32 1 10.244.4.0/32 flannel.1
172.20.0.2 network AddressStatus flannel.1/fe80::10b5:44ff:fe62:6fb8/64 2 fe80::10b5:44ff:fe62:6fb8/64 flannel.1
172.20.0.2 network AddressStatus lo/127.0.0.1/8 1 127.0.0.1/8 lo
172.20.0.2 network AddressStatus lo/::1/128 1 ::1/128 lo
In the output there are addresses set up by Talos (e.g. eth0/172.20.0.2/24
) and
addresses set up by other facilities (e.g. flannel.1/10.244.4.0/32
set up by CNI).
Talos networking controllers watch the kernel state and update resources accordingly.
Additional details about the address can be accessed via the YAML output:
# talosctl get address eth0/172.20.0.2/24 -o yaml
node: 172.20.0.2
metadata:
namespace: network
type: AddressStatuses.net.talos.dev
id: eth0/172.20.0.2/24
version: 1
owner: network.AddressStatusController
phase: running
created: 2021-06-29T20:23:18Z
updated: 2021-06-29T20:23:18Z
spec:
address: 172.20.0.2/24
local: 172.20.0.2
broadcast: 172.20.0.255
linkIndex: 4
linkName: eth0
family: inet4
scope: global
flags: permanent
Resources can be watched for changes with the --watch
flag to see how configuration changes over time.
Other networking status resources can be inspected with talosctl get routes
, talosctl get links
, etc.
For example:
$ talosctl get resolvers
NODE NAMESPACE TYPE ID VERSION RESOLVERS
172.20.0.2 network ResolverStatus resolvers 2 ["8.8.8.8","1.1.1.1"]
# talosctl get links -o yaml
node: 172.20.0.2
metadata:
namespace: network
type: LinkStatuses.net.talos.dev
id: eth0
version: 2
owner: network.LinkStatusController
phase: running
created: 2021-06-29T20:23:18Z
updated: 2021-06-29T20:23:18Z
spec:
index: 4
type: ether
linkIndex: 0
flags: UP,BROADCAST,RUNNING,MULTICAST,LOWER_UP
hardwareAddr: 4e:95:8e:8f:e4:47
broadcastAddr: ff:ff:ff:ff:ff:ff
mtu: 1500
queueDisc: pfifo_fast
operationalState: up
kind: ""
slaveKind: ""
driver: virtio_net
linkState: true
speedMbit: 4294967295
port: Other
duplex: Unknown
Inspecting Configuration
The desired networking configuration is combined from multiple sources and presented
as *Spec
resources:
$ talosctl get addressspecs
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 network AddressSpec eth0/172.20.0.2/24 2
172.20.0.2 network AddressSpec lo/127.0.0.1/8 2
172.20.0.2 network AddressSpec lo/::1/128 2
These AddressSpecs
are applied to the Linux kernel to reach the desired state.
If, for example, an AddressSpec
is removed, the address is removed from the Linux network interface as well.
*Spec
resources can’t be manipulated directly, they are generated automatically by Talos
from multiple configuration sources (see a section below for details).
If a *Spec
resource is queried in YAML format, some additional information is available:
# talosctl get addressspecs eth0/172.20.0.2/24 -o yaml
node: 172.20.0.2
metadata:
namespace: network
type: AddressSpecs.net.talos.dev
id: eth0/172.20.0.2/24
version: 2
owner: network.AddressMergeController
phase: running
created: 2021-06-29T20:23:18Z
updated: 2021-06-29T20:23:18Z
finalizers:
- network.AddressSpecController
spec:
address: 172.20.0.2/24
linkName: eth0
family: inet4
scope: global
flags: permanent
layer: operator
An important field is the layer
field, which describes a configuration layer this spec is coming from: in this case, it’s generated by a network operator (see below) and is set by the DHCPv4 operator.
Configuration Merging
Spec resources described in the previous section show the final merged configuration state,
while initial specs are put to a different unmerged namespace network-config
.
Spec resources in the network-config
namespace are merged with conflict resolution to produce the final merged representation in the network
namespace.
Let’s take HostnameSpec
as an example.
The final merged representation is:
# talosctl get hostnamespec -o yaml
node: 172.20.0.2
metadata:
namespace: network
type: HostnameSpecs.net.talos.dev
id: hostname
version: 2
owner: network.HostnameMergeController
phase: running
created: 2021-06-29T20:23:18Z
updated: 2021-06-29T20:23:18Z
finalizers:
- network.HostnameSpecController
spec:
hostname: talos-default-controlplane-1
domainname: ""
layer: operator
We can see that the final configuration for the hostname is talos-default-controlplane-1
.
And this is the hostname that was actually applied.
This can be verified by querying a HostnameStatus
resource:
$ talosctl get hostnamestatus
NODE NAMESPACE TYPE ID VERSION HOSTNAME DOMAINNAME
172.20.0.2 network HostnameStatus hostname 1 talos-default-controlplane-1
Initial configuration for the hostname in the network-config
namespace is:
# talosctl get hostnamespec -o yaml --namespace network-config
node: 172.20.0.2
metadata:
namespace: network-config
type: HostnameSpecs.net.talos.dev
id: default/hostname
version: 2
owner: network.HostnameConfigController
phase: running
created: 2021-06-29T20:23:18Z
updated: 2021-06-29T20:23:18Z
spec:
hostname: talos-172-20-0-2
domainname: ""
layer: default
---
node: 172.20.0.2
metadata:
namespace: network-config
type: HostnameSpecs.net.talos.dev
id: dhcp4/eth0/hostname
version: 1
owner: network.OperatorSpecController
phase: running
created: 2021-06-29T20:23:18Z
updated: 2021-06-29T20:23:18Z
spec:
hostname: talos-default-controlplane-1
domainname: ""
layer: operator
We can see that there are two specs for the hostname:
- one from the
default
configuration layer which defines the hostname astalos-172-20-0-2
(default driven by the default node address); - another one from the layer
operator
that defines the hostname astalos-default-controlplane-1
(DHCP).
Talos merges these two specs into a final HostnameSpec
based on the configuration layer and merge rules.
Here is the order of precedence from low to high:
default
(defaults provided by Talos);cmdline
(from the kernel command line);platform
(driven by the cloud provider);operator
(various dynamic configuration options: DHCP, Virtual IP, etc);configuration
(derived from the machine configuration).
So in our example the operator
layer HostnameSpec
overrides the default
layer producing the final hostname talos-default-controlplane-1
.
The merge process applies to all six core networking specs.
For each spec, the layer
controls the merge behavior
If multiple configuration specs
appear at the same layer, they can be merged together if possible, otherwise merge result
is stable but not defined (e.g. if DHCP on multiple interfaces provides two different hostnames for the node).
LinkSpecs
are merged across layers, so for example, machine configuration for the interface MTU overrides an MTU set by the DHCP server.
Network Operators
Network operators provide dynamic network configuration which can change over time as the node is running:
- DHCPv4
- DHCPv6
- Virtual IP
Network operators produce specs for addresses, routes, links, etc., which are then merged and applied according to the rules described above.
Operators are configured with OperatorSpec
resources which describe when operators
should run and additional configuration for the operator:
# talosctl get operatorspecs -o yaml
node: 172.20.0.2
metadata:
namespace: network
type: OperatorSpecs.net.talos.dev
id: dhcp4/eth0
version: 1
owner: network.OperatorConfigController
phase: running
created: 2021-06-29T20:23:18Z
updated: 2021-06-29T20:23:18Z
spec:
operator: dhcp4
linkName: eth0
requireUp: true
dhcp4:
routeMetric: 1024
OperatorSpec
resources are generated by Talos based on machine configuration mostly.
DHCP4 operator is created automatically for all physical network links which are not configured explicitly via the kernel command line or the machine configuration.
This also means that on the first boot, without a machine configuration, a DHCP request is made on all physical network interfaces by default.
Specs generated by operators are prefixed with the operator ID (dhcp4/eth0
in the example above) in the unmerged network-config
namespace:
$ talosctl -n 172.20.0.2 get addressspecs --namespace network-config
NODE NAMESPACE TYPE ID VERSION
172.20.0.2 network-config AddressSpec dhcp4/eth0/eth0/172.20.0.2/24 1
Other Network Resources
There are some additional resources describing the network subsystem state.
The NodeAddress
resource presents node addresses excluding link-local and loopback addresses:
$ talosctl get nodeaddresses
NODE NAMESPACE TYPE ID VERSION ADDRESSES
10.100.2.23 network NodeAddress accumulative 6 ["10.100.2.23","147.75.98.173","147.75.195.143","192.168.95.64","2604:1380:1:ca00::17"]
10.100.2.23 network NodeAddress current 5 ["10.100.2.23","147.75.98.173","192.168.95.64","2604:1380:1:ca00::17"]
10.100.2.23 network NodeAddress default 1 ["10.100.2.23"]
default
is the node default address;current
is the set of addresses a node currently has;accumulative
is the set of addresses a node had over time (it might include virtual IPs which are not owned by the node at the moment).
NodeAddress
resources are used to pick up the default address for etcd
peer URL, to populate SANs field in the generated certificates, etc.
Another important resource is Nodename
which provides Node
name in Kubernetes:
$ talosctl get nodename
NODE NAMESPACE TYPE ID VERSION NODENAME
10.100.2.23 controlplane Nodename nodename 1 infra-green-cp-mmf7v
Depending on the machine configuration nodename
might be just a hostname or the FQDN of the node.
NetworkStatus
aggregates the current state of the network configuration:
# talosctl get networkstatus -o yaml
node: 10.100.2.23
metadata:
namespace: network
type: NetworkStatuses.net.talos.dev
id: status
version: 5
owner: network.StatusController
phase: running
created: 2021-06-24T18:56:00Z
updated: 2021-06-24T18:56:02Z
spec:
addressReady: true
connectivityReady: true
hostnameReady: true
etcFilesReady: true
Network Controllers
For each of the six basic resource types, there are several controllers:
*StatusController
populates*Status
resources observing the Linux kernel state.*ConfigController
produces the initial unmerged*Spec
resources in thenetwork-config
namespace based on defaults, kernel command line, and machine configuration.*MergeController
merges*Spec
resources into the final representation in thenetwork
namespace.*SpecController
applies merged*Spec
resources to the kernel state.
For the network operators:
OperatorConfigController
producesOperatorSpec
resources based on machine configuration and deafauls.OperatorSpecController
runs network operators watchingOperatorSpec
resources and producing various*Spec
resources in thenetwork-config
namespace.
Configuration Sources
There are several configuration sources for the network configuration, which are described in this section.
Defaults
lo
interface is assigned addresses127.0.0.1/8
and::1/128
;- hostname is set to the
talos-<IP>
whereIP
is the default node address; - resolvers are set to
8.8.8.8
,1.1.1.1
; - time servers are set to
pool.ntp.org
; - DHCP4 operator is run on any physical interface which is not configured explicitly.
Cmdline
The kernel command line is parsed for the following options:
ip=
option is parsed for node IP, default gateway, hostname, DNS servers, NTP servers;bond=
option is parsed for bonding interfaces and their options;talos.hostname=
option is used to set node hostname;talos.network.interface.ignore=
can be used to make Talos skip network interface configuration completely.
Platform
Platform configuration delivers cloud environment-specific options (e.g. the hostname).
Platform configuration is specific to the environment metadata: for example, on Equinix Metal, Talos automatically configures public and private IPs, routing, link bonding, hostname.
Platform configuration is cached across reboots in /system/state/platform-network.yaml
.
Operator
Network operators provide configuration for all basic resource types.
Machine Configuration
The machine configuration is parsed for link configuration, addresses, routes, hostname,
resolvers and time servers.
Any changes to .machine.network
configuration can be applied in immediate mode.
Network Configuration Debugging
Most of the network controller operations and failures are logged to the kernel console,
additional logs with debug
level are available with talosctl logs controller-runtime
command.
If the network configuration can’t be established and the API is not available, debug
level
logs can be sent to the console with debug: true
option in the machine configuration.
8 - Network Connectivity
Configuring Network Connectivity
The simplest way to deploy Talos is by ensuring that all the remote components of the system (talosctl
, the control plane nodes, and worker nodes) all have layer 2 connectivity.
This is not always possible, however, so this page lays out the minimal network access that is required to configure and operate a talos cluster.
Note: These are the ports required for Talos specifically, and should be configured in addition to the ports required by kubernetes. See the kubernetes docs for information on the ports used by kubernetes itself.
Control plane node(s)
Protocol | Direction | Port Range | Purpose | Used By |
---|---|---|---|---|
TCP | Inbound | 50000* | apid | talosctl, control plane nodes |
TCP | Inbound | 50001* | trustd | Worker nodes |
Ports marked with a
*
are not currently configurable, but that may change in the future. Follow along here.
Worker node(s)
Protocol | Direction | Port Range | Purpose | Used By |
---|---|---|---|---|
TCP | Inbound | 50000* | apid | Control plane nodes |
Ports marked with a
*
are not currently configurable, but that may change in the future. Follow along here.
9 - KubeSpan
WireGuard Peer Discovery
The key pieces of information needed for WireGuard generally are:
- the public key of the host you wish to connect to
- an IP address and port of the host you wish to connect to
The latter is really only required of one side of the pair. Once traffic is received, that information is learned and updated by WireGuard automatically.
Kubernetes, though, also needs to know which traffic goes to which WireGuard peer. Because this information may be dynamic, we need a way to keep this information up to date.
If we already have a connection to Kubernetes, it’s fairly easy: we can just keep that information in Kubernetes. Otherwise, we have to have some way to discover it.
Talos Linux implements a multi-tiered approach to gathering this information. Each tier can operate independently, but the amalgamation of the mechanisms produces a more robust set of connection criteria.
These mechanisms are:
- an external service
- a Kubernetes-based system
See discovery service to learn more about the external service.
The Kubernetes-based system utilizes annotations on Kubernetes Nodes which describe each node’s public key and local addresses.
On top of this, KubeSpan can optionally route Pod subnets. This is usually taken care of by the CNI, but there are many situations where the CNI fails to be able to do this itself, across networks.
NAT, Multiple Routes, Multiple IPs
One of the difficulties in communicating across networks is that there is often not a single address and port which can identify a connection for each node on the system.
For instance, a node sitting on the same network might see its peer as 192.168.2.10
, but a node across the internet may see it as 2001:db8:1ef1::10
.
We need to be able to handle any number of addresses and ports, and we also need to have a mechanism to try them. WireGuard only allows us to select one at a time.
KubeSpan implements a controller which continuously discovers and rotates these IP:port pairs until a connection is established. It then starts trying again if that connection ever fails.
Packet Routing
After we have established a WireGuard connection, we have to make sure that the right packets get sent to the WireGuard interface.
WireGuard supplies a convenient facility for tagging packets which come from it, which is great. But in our case, we need to be able to allow traffic which both does not come from WireGuard and also is not destined for another Kubernetes node to flow through the normal mechanisms.
Unlike many corporate or privacy-oriented VPNs, we need to allow general internet traffic to flow normally.
Also, as our cluster grows, this set of IP addresses can become quite large and quite dynamic.
This would be very cumbersome and slow in iptables
.
Luckily, the kernel supplies a convenient mechanism by which to define this arbitrarily large set of IP addresses: IP sets.
Talos collects all of the IPs and subnets which are considered “in-cluster” and maintains these in the kernel as an IP set.
Now that we have the IP set defined, we need to tell the kernel how to use it.
The traditional way of doing this would be to use iptables
.
However, there is a big problem with IPTables.
It is a common namespace in which any number of other pieces of software may dump things.
We have no surety that what we add will not be wiped out by something else (from Kubernetes itself, to the CNI, to some workload application), be rendered unusable by higher-priority rules, or just generally cause trouble and conflicts.
Instead, we use a three-pronged system which is both more foundational and less centralised.
NFTables offers a separately namespaced, decentralised way of marking packets for later processing based on IP sets. Instead of a common set of well-known tables, NFTables uses hooks into the kernel’s netfilter system, which are less vulnerable to being usurped, bypassed, or a source of interference than IPTables, but which are rendered down by the kernel to the same underlying XTables system.
Our NFTables system is where we store the IP sets. Any packet which enters the system, either by forward from inside Kubernetes or by generation from the host itself, is compared against a hash table of this IP set. If it is matched, it is marked for later processing by our next stage. This is a high-performance system which exists fully in the kernel and which ultimately becomes an eBPF program, so it scales well to hundreds of nodes.
The next stage is the kernel router’s route rules. These are defined as a common ordered list of operations for the whole operating system, but they are intended to be tightly constrained and are rarely used by applications in any case. The rules we add are very simple: if a packet is marked by our NFTables system, send it to an alternate routing table.
This leads us to our third and final stage of packet routing. We have a custom routing table with two rules:
- send all IPv4 traffic to the WireGuard interface
- send all IPv6 traffic to the WireGuard interface
So in summary, we:
- mark packets destined for Kubernetes applications or Kubernetes nodes
- send marked packets to a special routing table
- send anything which is sent to that routing table through the WireGuard interface
This gives us an isolated, resilient, tolerant, and non-invasive way to route Kubernetes traffic safely, automatically, and transparently through WireGuard across almost any set of network topologies.
Design Decisions
Routing
Routing for Wireguard is a touch complicated when the set of possible peer endpoints includes at least one member of the set of destinations. That is, packets from Wireguard to a peer endpoint should not be sent to Wireguard, lest a loop be created.
In order to handle this situation, Wireguard provides the ability to mark packets which it generates, so their routing can be handled separately.
In our case, though, we actually want the inverse of this: we want to route Wireguard packets however the normal networking routes and rules say they should be routed, while packets destined for the other side of Wireguard Peers should be forced into Wireguard interfaces.
While IP Rules allow you to invert matches, they do not support matching based on IP sets. That means, to use simple rules, we would have to add a rule for each destination, which could reach into hundreds or thousands of rules to manage. This is not really much of a performance issue, but it is a management issue, since it is expected that we would not be the only manager of rules in the system, and rules offer no facility to tag for ownership.
IP Sets are supported by IPTables, and we could integrate there. However, IPTables exists in a global namespace, which makes it fragile having multiple parties manipulating it. The newer NFTables replacement for IPTables, though, allows users to independently hook into various points of XTables, keeping all such rules and sets independent. This means that regardless of what CNIs or other user-side routing rules may do, our KubeSpan setup will not be messed up.
Therefore, we utilise NFTables (which natively supports IP sets and owner
grouping) instead, to mark matching traffic which should be sent to the
Wireguard interface.
This way, we can keep all our KubeSpan set logic in one place, allowing us to
simply use a single ip rule
match:
for our fwmark, and sending those matched packets to a separate routing table
with one rule: default to the wireguard interface.
So we have three components:
- A routing table for Wireguard-destined packets
- An NFTables table which defines the set of destinations packets to which will
be marked with our firewall mark.
- Hook into PreRouting (type Filter)
- Hook into Outgoing (type Route)
- One IP Rule which sends packets marked with our firewall mark to our Wireguard routing table.
Routing Table
The routing table (number 180 by default) is simple, containing a single route for each family: send everything through the Wireguard interface.
NFTables
The logic inside NFTables is fairly simple.
First, everything is compiled into a single table: talos_kubespan
.
Next, two chains are set up: one for the prerouting
hook (kubespan_prerouting
)
and the other for the outgoing
hook (kubespan_outgoing
).
We define two sets of target IP prefixes: one for IPv6 (kubespan_targets_ipv6
)
and the other for IPv4 (kubespan_targets_ipv4
).
Last, we add rules to each chain which basically specify:
- If the packet is marked as from Wireguard, just accept it and terminate the chain.
- If the packet matches an IP in either of the target IP sets, mark that packet with the to Wireguard mark.
Rules
There are two route rules defined: one to match IPv6 packets and the other to match IPv4 packets.
These rules say the same thing for each: if the packet is marked that it should go to Wireguard, send it to the Wireguard routing table.
Firewall Mark
KubeSpan is using only two bits of the firewall mark with the mask 0x00000060
.
Note: if other software on the node is using the bits
0x60
of the firewall mark, this might cause conflicts and break KubeSpan.At the moment of the writing, it was confirmed that Calico CNI is using bits
0xffff0000
and Cilium CNI is using bits0xf00
, so KubeSpan is compatible with both. Flannel CNI uses0x4000
mask, so it is also compatible.
In the routing rules table, we match on the mark 0x40
with the mask 0x60
:
32500: from all fwmark 0x40/0x60 lookup 180
In the NFTables table, we match with the same mask 0x60
and we set the mask by only modifying
bits from the 0x60
mask:
meta mark & 0x00000060 == 0x00000020 accept
ip daddr @kubespan_targets_ipv4 meta mark set meta mark & 0xffffffdf | 0x00000040 accept
ip6 daddr @kubespan_targets_ipv6 meta mark set meta mark & 0xffffffdf | 0x00000040 accept
10 - Process Capabilities
Linux defines a set of process capabilities that can be used to fine-tune the process permissions.
Talos Linux for security reasons restricts any process from gaining the following capabilities:
CAP_SYS_MODULE
(loading kernel modules)CAP_SYS_BOOT
(rebooting the system)
This means that any process including privileged Kubernetes pods will not be able to get these capabilities.
If you see the following error on starting a pod, make sure it doesn’t have any of the capabilities listed above in the spec:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: unable to apply caps: operation not permitted: unknown
Note: even with
CAP_SYS_MODULE
capability, Linux kernel module loading is restricted by requiring a valid signature. Talos Linux creates a throw away signing key during kernel build, so it’s not possible to build/sign a kernel module for Talos Linux outside of the build process.
11 - talosctl
The talosctl
tool acts as a reference implementation for the Talos API, but it also handles a lot of
conveniences for the use of Talos and its clusters.
Video Walkthrough
To see some live examples of talosctl usage, view the following video:
Client Configuration
Talosctl configuration is located in $XDG_CONFIG_HOME/talos/config.yaml
if $XDG_CONFIG_HOME
is defined.
Otherwise it is in $HOME/.talos/config
.
The location can always be overridden by the TALOSCONFIG
environment variable or the --talosconfig
parameter.
Like kubectl
, talosctl
uses the concept of configuration contexts, so any number of Talos clusters can be managed with a single configuration file.
It also comes with some intelligent tooling to manage the merging of new contexts into the config.
The default operation is a non-destructive merge, where if a context of the same name already exists in the file, the context to be added is renamed by appending an index number.
You can easily overwrite instead, as well.
See the talosctl config help
for more information.
Endpoints and Nodes
endpoints
are the communication endpoints to which the client directly talks.
These can be load balancers, DNS hostnames, a list of IPs, etc.
If multiple endpoints are specified, the client will automatically load
balance and fail over between them.
It is recommended that these point to the set of control plane nodes, either directly or through a load balancer.
Each endpoint will automatically proxy requests destined to another node through it, so it is not necessary to change the endpoint configuration just because you wish to talk to a different node within the cluster.
Endpoints do, however, need to be members of the same Talos cluster as the target node, because these proxied connections reply on certificate-based authentication.
The node
is the target node on which you wish to perform the API call.
While you can configure the target node (or even set of target nodes) inside the ’talosctl’ configuration file, it is recommended not to do so, but to explicitly declare the target node(s) using the -n
or --nodes
command-line parameter.
When specifying nodes, their IPs and/or hostnames are as seen by the endpoint servers, not as from the client. This is because all connections are proxied first through the endpoints.
Kubeconfig
The configuration for accessing a Talos Kubernetes cluster is obtained with talosctl
.
By default, talosctl
will safely merge the cluster into the default kubeconfig.
Like talosctl
itself, in the event of a naming conflict, the new context name will be index-appended before insertion.
The --force
option can be used to overwrite instead.
You can also specify an alternate path by supplying it as a positional parameter.
Thus, like Talos clusters themselves, talosctl
makes it easy to manage any
number of kubernetes clusters from the same workstation.
Commands
Please see the CLI reference for the entire list of commands which are available from talosctl
.
12 - FAQs
How is Talos different from other container optimized Linux distros?
Talos integrates tightly with Kubernetes, and is not meant to be a general-purpose operating system. The most important difference is that Talos is fully controlled by an API via a gRPC interface, instead of an ordinary shell. We don’t ship SSH, and there is no console access. Removing components such as these has allowed us to dramatically reduce the footprint of Talos, and in turn, improve a number of other areas like security, predictability, reliability, and consistency across platforms. It’s a big change from how operating systems have been managed in the past, but we believe that API-driven OSes are the future.
Why no shell or SSH?
Since Talos is fully API-driven, all maintenance and debugging operations are possible via the OS API.
We would like for Talos users to start thinking about what a “machine” is in the context of a Kubernetes cluster.
That is, that a Kubernetes cluster can be thought of as one massive machine, and the nodes are merely additional, undifferentiated resources.
We don’t want humans to focus on the nodes, but rather on the machine that is the Kubernetes cluster.
Should an issue arise at the node level, talosctl
should provide the necessary tooling to assist in the identification, debugging, and remediation of the issue.
However, the API is based on the Principle of Least Privilege, and exposes only a limited set of methods.
We envision Talos being a great place for the application of control theory in order to provide a self-healing platform.
Why the name “Talos”?
Talos was an automaton created by the Greek God of the forge to protect the island of Crete. He would patrol the coast and enforce laws throughout the land. We felt it was a fitting name for a security focused operating system designed to run Kubernetes.
Why does Talos rely on a separate configuration from Kubernetes?
The talosconfig
file contains client credentials to access the Talos Linux API.
Sometimes Kubernetes might be down for a number of reasons (etcd issues, misconfiguration, etc.), while Talos API access will always be available.
The Talos API is a way to access the operating system and fix issues, e.g. fixing access to Kubernetes.
When Talos Linux is running fine, using the Kubernetes APIs (via kubeconfig
) is all you should need to deploy and manage Kubernetes workloads.
How does Talos handle certificates?
During the machine config generation process, Talos generates a set of certificate authorities (CAs) that remains valid for 10 years.
Talos is responsible for managing certificates for etcd
, Talos API (apid
), node certificates (kubelet
), and other components.
It also handles the automatic rotation of server-side certificates.
However, client certificates such as talosconfig
and kubeconfig
are the user’s responsibility, and by default, they have a validity period of 1 year.
To renew the talosconfig
certificate, the follow this process.
To renew kubeconfig
, use talosctl kubeconfig
command, and the time-to-live (TTL) is defined in the configuration.
How can I set the timezone of my Talos Linux clusters?
Talos doesn’t support timezones, and will always run in UTC. This ensures consistency of log timestamps for all Talos Linux clusters, simplifying debugging. Your containers can run with any timezone configuration you desire, but the timezone of Talos Linux is not configurable.
How do I see Talos kernel configuration?
Using Talos API
Current kernel config can be read with talosctl -n <NODE> read /proc/config.gz
.
For example:
talosctl -n NODE read /proc/config.gz | zgrep E1000
Using GitHub
For amd64
, see https://github.com/siderolabs/pkgs/blob/main/kernel/build/config-amd64.
Use appropriate branch to see the kernel config matching your Talos release.
13 - Knowledge Base
Disabling GracefulNodeShutdown
on a node
Talos Linux enables Graceful Node Shutdown Kubernetes feature by default.
If this feature should be disabled, modify the kubelet
part of the machine configuration with:
machine:
kubelet:
extraArgs:
feature-gates: GracefulNodeShutdown=false
extraConfig:
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
Generating Talos Linux ISO image with custom kernel arguments
Pass additional kernel arguments using --extra-kernel-arg
flag:
$ docker run --rm -i ghcr.io/siderolabs/imager:v1.10.0-alpha.0 iso --arch amd64 --tar-to-stdout --extra-kernel-arg console=ttyS1 --extra-kernel-arg console=tty0 | tar xz
2022/05/25 13:18:47 copying /usr/install/amd64/vmlinuz to /mnt/boot/vmlinuz
2022/05/25 13:18:47 copying /usr/install/amd64/initramfs.xz to /mnt/boot/initramfs.xz
2022/05/25 13:18:47 creating grub.cfg
2022/05/25 13:18:47 creating ISO
ISO will be output to the file talos-<arch>.iso
in the current directory.
Logging Kubernetes audit logs with loki
If using loki-stack helm chart to gather logs from the Kubernetes cluster, you can use the helm values to configure loki-stack to log Kubernetes API server audit logs:
promtail:
extraArgs:
- -config.expand-env
# this is required so that the promtail process can read the kube-apiserver audit logs written as `nobody` user
containerSecurityContext:
capabilities:
add:
- DAC_READ_SEARCH
extraVolumes:
- name: audit-logs
hostPath:
path: /var/log/audit/kube
extraVolumeMounts:
- name: audit-logs
mountPath: /var/log/audit/kube
readOnly: true
config:
snippets:
extraScrapeConfigs: |
- job_name: auditlogs
static_configs:
- targets:
- localhost
labels:
job: auditlogs
host: ${HOSTNAME}
__path__: /var/log/audit/kube/*.log
Setting CPU scaling governor
While its possible to set CPU scaling governor via .machine.sysfs
it’s sometimes cumbersome to set it for all CPU’s individually.
A more elegant approach would be set it via a kernel commandline parameter.
This also means that the options are applied way early in the boot process.
This can be set in the machineconfig via the snippet below:
machine:
install:
extraKernelArgs:
- cpufreq.default_governor=performance
Note: Talos needs to be upgraded for the
extraKernelArgs
to take effect.
Disable admissionControl
on control plane nodes
Talos Linux enables admission control in the API Server by default.
Although it is not recommended from a security point of view, admission control can be removed by patching your control plane machine configuration:
talosctl gen config \
my-cluster https://mycluster.local:6443 \
--config-patch-control-plane '[{"op": "remove", "path": "/cluster/apiServer/admissionControl"}]'