This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Welcome

Welcome

Welcome to the Talos documentation. If you are just getting familiar with Talos, we recommend starting here:

  • What is Talos: a quick description of Talos
  • Quickstart: the fastest way to get a Talos cluster up and running
  • Getting Started: a long-form, guided tour of getting a full Talos cluster deployed

Open Source

Community

If you’re interested in this project and would like to help in engineering efforts, or have general usage questions, we are happy to have you! We hold a weekly meeting that all audiences are welcome to attend.

We would appreciate your feedback so that we can make Talos even better! To do so, you can take our survey.

Office Hours

You can subscribe to this meeting by joining the community forum above.

Enterprise

If you are using Talos in a production setting, and need consulting services to get started or to integrate Talos into your existing environment, we can help. Sidero Labs, Inc. offers support contracts with SLA (Service Level Agreement)-bound terms for mission-critical environments.

Learn More

1 - Introduction

1.1 - What is Talos?

Talos is a container optimized Linux distro; a reimagining of Linux for distributed systems such as Kubernetes. Designed to be as minimal as possible while still maintaining practicality. For these reasons, Talos has a number of features unique to it:

  • it is immutable
  • it is atomic
  • it is ephemeral
  • it is minimal
  • it is secure by default
  • it is managed via a single declarative configuration file and gRPC API

Talos can be deployed on container, cloud, virtualized, and bare metal platforms.

Why Talos

In having less, Talos offers more. Security. Efficiency. Resiliency. Consistency.

All of these areas are improved simply by having less.

1.2 - Quickstart

Local Docker Cluster

The easiest way to try Talos is by using the CLI (talosctl) to create a cluster on a machine with docker installed.

Prerequisites

talosctl

Download talosctl:

amd64
curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl
arm64

For linux and darwin operating systems talosctl is also available for the arm64 processor architecture.

curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-arm64
chmod +x /usr/local/bin/talosctl

kubectl

Download kubectl via one of methods outlined in the documentation.

Create the Cluster

Now run the following:

talosctl cluster create

Verify that you can reach Kubernetes:

$ kubectl get nodes -o wide
NAME                     STATUS   ROLES    AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
talos-default-master-1   Ready    master   115s   v1.20.2   10.5.0.2      <none>        Talos (v0.14.0)   <host kernel>    containerd://1.5.5
talos-default-worker-1   Ready    <none>   115s   v1.20.2   10.5.0.3      <none>        Talos (v0.14.0)   <host kernel>    containerd://1.5.5

Destroy the Cluster

When you are all done, remove the cluster:

talosctl cluster destroy

1.3 - Getting Started

This document will walk you through installing a full Talos Cluster. You may wish to read through the Quickstart first, to quickly create a local virtual cluster on your workstation.

Regardless of where you run Talos, you will find that there is a pattern to deploying it.

In general you will need to:

  • acquire the installation image
  • decide on the endpoint for Kubernetes
    • optionally create a load balancer
  • configure Talos
  • configure talosctl
  • bootstrap Kubernetes

Prerequisites

talosctl

The talosctl tool provides a CLI tool which interfaces with the Talos API in an easy manner. It also includes a number of useful tools for creating and managing your clusters.

You should install talosctl before continuing:

amd64

curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

arm64

For linux and darwin operating systems talosctl is also available for the arm64 processor architecture.

curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-arm64
chmod +x /usr/local/bin/talosctl

Acquire the installation image

The easiest way to install Talos is to use the ISO image.

The latest ISO image can be found on the Github Releases page:

For self-built media and network booting, you can use the kernel and initramfs:

When booted from the ISO, Talos will run in RAM, and it will not install itself until it is provided a configuration. Thus, it is safe to boot the ISO onto any machine.

Alternative Booting

If you wish to use a different boot mechanism (such as network boot or a custom ISO), there are a number of required kernel parameters.

Please see the kernel docs for more information.

Decide the Kubernetes Endpoint

In order to configure Kubernetes and bootstrap the cluster, Talos needs to know what the endpoint (DNS name or IP address) of the Kubernetes API Server will be.

The endpoint should be the fully-qualified HTTP(S) URL for the Kubernetes API Server, which (by default) runs on port 6443 using HTTPS.

Thus, the format of the endpoint may be something like:

  • https://192.168.0.10:6443
  • https://kube.mycluster.mydomain.com:6443
  • https://[2001:db8:1234::80]:6443

Because the Kubernetes controlplane is meant to be supplied in a high availability manner, we must also choose how to bind it to the servers themselves. There are three common ways to do this.

Dedicated Load-balancer

If you are using a cloud provider or have your own load-balancer available (such as HAProxy, nginx reverse proxy, or an F5 load-balancer), using a dedicated load balancer is a natural choice. Just create an appropriate frontend matching the endpoint, and point the backends at each of the addresses of the Talos controlplane nodes.

This is convenient if a load-balancer is available, but don’t worry if that is not the case.

Layer 2 Shared IP

Talos has integrated support for serving Kubernetes from a shared (sometimes called “virtual”) IP address. This method relies on OSI Layer 2 connectivity between controlplane Talos nodes.

In this case, we may choose an IP address on the same subnet as the Talos controlplane nodes which is not otherwise assigned to any machine. For instance, if your controlplane node IPs are:

  • 192.168.0.10
  • 192.168.0.11
  • 192.168.0.12

You could choose the ip 192.168.0.15 as your shared IP address. Just make sure that 192.168.0.15 is not used by any other machine and that your DHCP will not serve it to any other machine.

Once chosen, form the full HTTPS URL from this IP:

https://192.168.0.15:6443

You are also free to set a DNS record to this IP address instead, but you will still need to use the IP address to set up the shared IP (machine.network.interfaces[].vip.ip) inside the Talos configuration.

For more information about using a shared IP, see the related Guide

DNS records

If neither of the other methods work for you, you can instead use DNS records to provide a measure of redundancy. In this case, you would add multiple A or AAAA records for a DNS name.

For instance, you could add:

kube.cluster1.mydomain.com  IN  A  192.168.0.10
kube.cluster1.mydomain.com  IN  A  192.168.0.11
kube.cluster1.mydomain.com  IN  A  192.168.0.12

Then, your endpoint would be:

https://kube.cluster1.mydomain.com:6443

Decide how to access the Talos API

Since Talos is entirely API-driven, it is important to know how you are going to access that API. Talos comes with a number of mechanisms to make that easier.

Controlplane nodes can proxy requests for worker nodes. This means that you only need access to the controlplane nodes in order to access the rest of the network. This is useful for security (your worker nodes do not need to have public IPs or be otherwise connected to the Internet), and it also makes working with highly-variable clusters easier, since you only need to know the controlplane nodes in advance.

Even better, the talosctl tool will automatically load balance and fail over between all of your controlplane nodes, so long as it is informed of each of the controlplane node IPs.

That does, of course, present the problem that you need to know how to talk to the controlplane nodes. In some environments, it is easy to be able to forecast, prescribe, or discover the controlplane node IP addresses. For others, though, even the controlplane nodes are dynamic, unpredictable, and undiscoverable.

The dynamic options above for the Kubernetes API endpoint also apply to the Talos API endpoints. The difference is that the Talos API runs on port 50000/tcp.

Whichever way you wish to access the Talos API, be sure to note the IP(s) or hostname(s) so that you can configure your talosctl tool’s endpoints below.

Configure Talos

When Talos boots without a configuration, such as when using the Talos ISO, it enters a limited maintenance mode and waits for a configuration to be provided.

Alternatively, the Talos installer can be booted with the talos.config kernel commandline argument set to an HTTP(s) URL from which it should receive its configuration. In cases where a PXE server can be available, this is much more efficient than manually configuring each node. If you do use this method, just note that Talos does require a number of other kernel commandline parameters. See the required kernel parameters for more information.

In either case, we need to generate the configuration which is to be provided. Luckily, the talosctl tool comes with a configuration generator for exactly this purpose.

  talosctl gen config "cluster-name" "cluster-endpoint"

Here, cluster-name is an arbitrary name for the cluster which will be used in your local client configuration as a label. It does not affect anything in the cluster itself. It is arbitrary, but it should be unique in the configuration on your local workstation.

The cluster-endpoint is where you insert the Kubernetes Endpoint you selected from above. This is the Kubernetes API URL, and it should be a complete URL, with https:// and port, if not 443. The default port is 6443, so the port is almost always required.

When you run this command, you will receive a number of files in your current directory:

  • controlplane.yaml
  • worker.yaml
  • talosconfig

The three .yaml files are what we call Machine Configs. They are installed onto the Talos servers to act as their complete configuration, describing everything from what disk Talos should be installed to, to what sysctls to set, to what network settings it should have. In the case of the controlplane.yaml, it even describes how Talos should form its Kubernetes cluster.

The talosconfig file (which is also YAML) is your local client configuration file.

Controlplane, Init, and Worker

The three types of Machine Configs correspond to the three roles of Talos nodes. For our purposes, you can ignore the Init type. It is a legacy type which will go away eventually. Its purpose was to self-bootstrap. Instead, we now use an API call to bootstrap the cluster, which is much more robust.

That leaves us with Controlplane and Worker.

The Controlplane Machine Config describes the configuration of a Talos server on which the Kubernetes Controlplane should run. The Worker Machine Config describes everything else: workload servers.

The main difference between Controlplane Machine Config files and Worker Machine Config files is that the former contains information about how to form the Kubernetes cluster.

Templates

The generated files can be thought of as templates. Individual machines may need specific settings (for instance, each may have a different static IP address). When different files are needed for machines of the same type, simply copy the source template (controlplane.yaml or worker.yaml) and make whatever modifications need to be done.

For instance, if you had three controlplane nodes and three worker nodes, you may do something like this:

  for i in $(seq 0 2); do
    cp controlplane.yaml cp$i.yaml
  end
  for i in $(seq 0 2); do
    cp worker.yaml w$i.yaml
  end

In cases where there is no special configuration needed, you may use the same file for each machine of the same type.

Apply Configuration

After you have generated each machine’s Machine Config, you need to load them into the mahines themselves. For that, you need to know their IP addresses.

If you have access to the console or console logs of the machines, you can read them to find the IP address(es). Talos will print them out during the boot process:

[    4.605369] [talos] task loadConfig (1/1): this machine is reachable at:
[    4.607358] [talos] task loadConfig (1/1):   192.168.0.2
[    4.608766] [talos] task loadConfig (1/1): server certificate fingerprint:
[    4.611106] [talos] task loadConfig (1/1):   xA9a1t2dMxB0NJ0qH1pDzilWbA3+DK/DjVbFaJBYheE=
[    4.613822] [talos] task loadConfig (1/1):
[    4.614985] [talos] task loadConfig (1/1): upload configuration using talosctl:
[    4.616978] [talos] task loadConfig (1/1):   talosctl apply-config --insecure --nodes 192.168.0.2 --file <config.yaml>
[    4.620168] [talos] task loadConfig (1/1): or apply configuration using talosctl interactive installer:
[    4.623046] [talos] task loadConfig (1/1):   talosctl apply-config --insecure --nodes 192.168.0.2 --interactive
[    4.626365] [talos] task loadConfig (1/1): optionally with node fingerprint check:
[    4.628692] [talos] task loadConfig (1/1):   talosctl apply-config --insecure --nodes 192.168.0.2 --cert-fingerprint 'xA9a1t2dMxB0NJ0qH1pDzilWbA3+DK/DjVbFaJBYheE=' --file <config.yaml>

If you do not have console access, the IP address may also be discoverable from your DHCP server.

Once you have the IP address, you can then apply the correct configuration.

  talosctl apply-config --insecure \
    --nodes 192.168.0.2 \
    --file cp0.yaml

The insecure flag is necessary at this point because the PKI infrastructure has not yet been made available to the node. Note that the connection will be encrypted, it is just unauthenticated.

If you have console access, though, you can extract the server certificate fingerprint and use it for an additional layer of validation:

  talosctl apply-config --insecure \
    --nodes 192.168.0.2 \
    --cert-fingerprint xA9a1t2dMxB0NJ0qH1pDzilWbA3+DK/DjVbFaJBYheE= \
    --file cp0.yaml

Using the fingerprint allows you to be sure you are sending the configuration to the right machine, but it is completely optional.

After the configuration is applied to a node, it will reboot.

You may repeat this process for each of the nodes in your cluster.

Configure your talosctl client

Now that the nodes are running Talos with its full PKI security suite, you need to use that PKI to talk to the machines. That means configuring your client, and that is what that talosconfig file is for.

Endpoints

Endpoints are the communication endpoints to which the client directly talks. These can be load balancers, DNS hostnames, a list of IPs, etc. In general, it is recommended that these point to the set of control plane nodes, either directly or through a reverse proxy or load balancer.

Each endpoint will automatically proxy requests destined to another node through it, so it is not necessary to change the endpoint configuration just because you wish to talk to a different node within the cluster.

Endpoints do, however, need to be members of the same Talos cluster as the target node, because these proxied connections reply on certificate-based authentication.

We need to set the endpoints in your talosconfig. talosctl will automatically load balance and fail over among the endpoints, so no external load balancer or DNS abstraction is required (though you are free to use them, if desired).

As an example, if the IP addresses of our controlplane nodes are:

  • 192.168.0.2
  • 192.168.0.3
  • 192.168.0.4

We would set those in the talosconfig with:

  talosctl --talosconfig=./talosconfig \
    config endpoint 192.168.0.2 192.168.0.3 192.168.0.4

Nodes

The node is the target node on which you wish to perform the API call.

Keep in mind, when specifying nodes that their IPs and/or hostnames are as seen by the endpoint servers, not as from the client. This is because all connections are proxied first through the endpoints.

Some people also like to set a default set of nodes in the talosconfig. This can be done in the same manner, replacing endpoint with node. If you do this, however, know that you could easily reboot the wrong machine by forgetting to declare the right one explicitly. Worse, if you set several nodes as defaults, you could, with one talosctl upgrade command upgrade your whole cluster all at the same time. It’s a powerful tool, and with that comes great responsibility.

The author of this document generally sets a single controlplane node to be the default node, which provides the most flexible default operation while limiting the scope of the disaster should a command be entered erroneously:

  talosctl --talosconfig=./talosconfig \
    config node 192.168.0.2

You may simply provide -n or --nodes to any talosctl command to supply the node or (comma-delimited) nodes on which you wish to perform the operation. Supplying the commandline parameter will override any default nodes in the configuration file.

To verify default node(s) you’re currently configured to use, you can run:

$ talosctl version
Client:
        ...
Server:
        NODE:        <node>
        ...

For a more in-depth discussion of Endpoints and Nodes, please see talosctl.

Default configuration file

You can reference which configuration file to use directly with the --talosconfig parameter:

  talosctl --talosconfig=./talosconfig \
    --nodes 192.168.0.2 version

However, talosctl comes with tooling to help you integrate and merge this configuration into the default talosctl configuration file. This is done with the merge option.

  talosctl config merge ./talosconfig

This will merge your new talosconfig into the default configuration file ($XDG_CONFIG_HOME/talos/config.yaml), creating it if necessary. Like Kubernetes, the talosconfig configuration files has multiple “contexts” which correspond to multiple clusters. The <cluster-name> you chose above will be used as the context name.

Kubernetes Bootstrap

All of your machines are configured, and your talosctl client is set up. Now, you are ready to bootstrap your Kubernetes cluster. If that sounds daunting, you haven’t used Talos before.

Bootstrapping your Kubernetes cluster with Talos is as simple as:

  talosctl bootstrap --nodes 192.168.0.2

IMPORTANT: the bootstrap operation should only be called ONCE and only on a SINGLE controlplane node!

The IP there can be any of your controlplanes (or the loadbalancer, if you have one). It should only be issued once.

At this point, Talos will form an etcd cluster, generate all of the core Kubernetes assets, and start the Kubernetes controlplane components.

After a few moments, you will be able to download your Kubernetes client configuration and get started:

  talosctl kubeconfig

Running this command will add (merge) you new cluster into you local Kubernetes configuration in the same way as talosctl config merge merged the Talos client configuration into your local Talos client configuration file.

If you would prefer for the configuration to not be merged into your default Kubernetes configuration file, simple tell it a filename:

  talosctl kubeconfig alternative-kubeconfig

If all goes well, you should now be able to connect to Kubernetes and see your nodes:

  kubectl get nodes

1.4 - System Requirements

Minimum Requirements

RoleMemoryCores
Init/Control Plane2GB2
Worker1GB1
RoleMemoryCores
Init/Control Plane4GB4
Worker2GB2

These requirements are similar to that of kubernetes.

1.5 - What's New in Talos 0.14

Kubelet

Kubelet configuration can be updated without node restart (.machine.kubelet section of machine configuration) with commands talosctl edit mc --immediate, talosctl apply-config --immediate, talosctl patch mc --immediate.

Kubelet service can now be restarted with talosctl service kubelet restart.

Kubelet node IP configuration (.machine.kubelet.nodeIP.validSubnets) can now include negative subnet matches (prefixed with !).

Kubernetes Upgrade Enhancements

talosctl upgrade-k8s was improved to:

  • sync all boostrap manifest resources in the Kubernetes cluster with versions bundled with current version Talos
  • upgrade kubelet to the version of the control plane components (without node reboot)

So there is no need to update CoreDNS, Flannel container manually after running upgrade-k8s anymore.

Log Shipping

Talos can now ship system logs to the configured destination using either JSON-over-UDP or JSON-over-TCP: see .machine.logging machine configuration option.

NTP Sync

Talos NTP sync process was improved to align better with kernel time adjustment periods and to filter out spikes.

talosctl support

talosctl CLI tool now has a new subcommand support that gathers all cluster information that could help with debugging in.

Output of the command is a zip archive with all Talos service logs, Kubernetes pod logs and manifests, Talos resources manifests and so on. Generated archive does not contain any secret information, so it is safe to send it for analysis to a third party.

Component Updates

  • Linux: 5.15.6
  • etcd: 3.5.1
  • containerd: 1.5.8
  • runc: 1.0.3
  • Kubernetes: 1.23.1
  • CoreDNS: 1.8.6
  • Flannel (default CNI): 0.15.1

Talos is built with Go 1.17.5

Cluster Discovery

Cluster Discovery is enabled by default for Talos 0.14. Cluster Discovery can be disabled with talosctl gen config --with-cluster-discovery=false.

Kexec and capabilities

When kexec support is disabled Talos no longer drops Linux capabilities (CAP_SYS_BOOT and CAP_SYS_MODULES) for child processes. That is helpful for advanced use-cases like Docker-in-Docker.

If you want to permanently disable kexec and capabilities dropping, pass kexec_load_disabled=1 argument to the kernel.

For example:

install:
  extraKernelArgs:
    - sysctl.kernel.kexec_load_disabled=1

Please note that capabilities are dropped before machine configuration is loaded, so disabling kexec via machine.sysctls will not be enough.

installer and imager images

Talos supports two target architectures: amd64 and arm64, so all Talos images are built for both amd64 and arm64.

New image imager was added which contains Talos assets for both architectures which allows to generate Talos disk images cross-arch: e.g. generate Talos Raspberry PI disk image on amd64 machine.

As installer image is used only to do initial install and upgrades, it now contains Talos assets for a specific architecture. This reduces size of the installer image leading to faster upgrades and less memory usage.

There are no user-visible changes except that now imager container image should be used to produce Talos disk images.

A set of Talos ehancements is going to unlock a number of exciting features in the upcoming release of Sidero:

  • SideroLink: a point-to-point Wireguard tunnel connecting Talos node back to the provisioning platform (Sidero).
  • event sink (kernel arg talos.event.sink=http://10.0.0.1:4000) delivers Talos internal events to the specified destination.
  • kmsg log delivery (kernel arg talos.logging.kernel=tcp://10.0.0.1:4001) sends kernel logs as JSON lines over TCP or UDP.

VLAN Enhancements

Talos now supports setting MTU and Virtual IPs on VLAN interfaces.

1.6 - Support Matrix

Talos Version0.140.13
Release Date2021-12-212021-10-11 (0.13.0)
End of Community Support1.0.0 release (2022-03-27, TBD)0.14.0 release (2021-12-21)
Enterprise Supportoffered by Sidero Labs Inc.
Kubernetes1.23, 1.22, 1.211.22, 1.21, 1.20
Architectureamd64, arm64
Platforms
- cloudAWS, GCP, Azure, Digital Ocean, Hetzner, OpenStack, Scaleway, Vultr, Upcloud
- bare metalx86: BIOS, UEFI; arm64: UEFI; boot: ISO, PXE, disk image
- virtualizedVMware, Hyper-V, KVM, Proxmox, Xen
- SBCsRaspberry Pi4, Banana Pi M64, Pine64, and other
- localDocker, QEMU
Cluster API
CAPI Bootstrap Provider Talos>= 0.4.3>= 0.3.0
CAPI Control Plane Provider Talos>= 0.4.1>= 0.1.1
Sidero>= 0.4.1>= 0.3.0
UI
Theila

Platform Tiers

Tier 1: Automated tests, high-priority fixes. Tier 2: Tested from time to time, medium-priority bugfixes. Tier 3: Not tested by core Talos team, community tested.

Tier 1

  • Metal
  • AWS
  • GCP

Tier 2

  • Azure
  • Digital Ocean
  • OpenStack
  • VMWare

Tier 3

  • Hetzner
  • nocloud
  • Scaleway
  • Vultr
  • Upcloud

2 - Bare Metal Platforms

2.1 - Digital Rebar

In this guide we will create an Kubernetes cluster with 1 worker node, and 2 controlplane nodes using an existing digital rebar deployment.

Prerequisites

Creating a Cluster

In this guide we will create an Kubernetes cluster with 1 worker node, and 2 controlplane nodes. We assume an existing digital rebar deployment, and some familiarity with iPXE.

We leave it up to the user to decide if they would like to use static networking, or DHCP. The setup and configuration of DHCP will not be covered.

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the load balancer, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-metal-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig

The loadbalancer is used to distribute the load across multiple controlplane nodes. This isn’t covered in detail, because we assume some loadbalancing knowledge before hand. If you think this should be added to the docs, please create a issue.

At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode metal
controlplane.yaml is valid for metal mode
$ talosctl validate --config worker.yaml --mode metal
worker.yaml is valid for metal mode

Publishing the Machine Configuration Files

Digital Rebar has a build-in fileserver, which means we can use this feature to expose the talos configuration files. We will place controlplane.yaml, and worker.yaml into Digital Rebar file server by using the drpcli tools.

Copy the generated files from the step above into your Digital Rebar installation.

drpcli file upload <file>.yaml as <file>.yaml

Replacing <file> with controlplane or worker.

Download the boot files

Download a recent version of boot.tar.gz from github.

Upload to DRB:

$ drpcli isos upload boot.tar.gz as talos.tar.gz
{
  "Path": "talos.tar.gz",
  "Size": 96470072
}

We have some Digital Rebar example files in the Git repo you can use to provision Digital Rebar with drpcli.

To apply these configs you need to create them, and then apply them as follow:

$ drpcli bootenvs create talos
{
  "Available": true,
  "BootParams": "",
  "Bundle": "",
  "Description": "",
  "Documentation": "",
  "Endpoint": "",
  "Errors": [],
  "Initrds": [],
  "Kernel": "",
  "Meta": {},
  "Name": "talos",
  "OS": {
    "Codename": "",
    "Family": "",
    "IsoFile": "",
    "IsoSha256": "",
    "IsoUrl": "",
    "Name": "",
    "SupportedArchitectures": {},
    "Version": ""
  },
  "OnlyUnknown": false,
  "OptionalParams": [],
  "ReadOnly": false,
  "RequiredParams": [],
  "Templates": [],
  "Validated": true
}
drpcli bootenvs update talos - < bootenv.yaml

You need to do this for all files in the example directory. If you don’t have access to the drpcli tools you can also use the webinterface.

It’s important to have a corresponding SHA256 hash matching the boot.tar.gz

Bootenv BootParams

We’re using some of Digital Rebar build in templating to make sure the machine gets the correct role assigned.

talos.platform=metal talos.config={{ .ProvisionerURL }}/files/{{.Param \"talos/role\"}}.yaml"

This is why we also include a params.yaml in the example directory to make sure the role is set to one of the following:

  • controlplane
  • worker

The {{.Param \"talos/role\"}} then gets populated with one of the above roles.

Boot the Machines

In the UI of Digital Rebar you need to select the machines you want te provision. Once selected, you need to assign to following:

  • Profile
  • Workflow

This will provision the Stage and Bootenv with the talos values. Once this is done, you can boot the machine.

To understand the boot process, we have a higher level overview located at metal overview.

Bootstrap Etcd

To configure talosctl we will need the first control plane node’s IP:

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

2.2 - Equinix Metal

Creating Talos cluster using Equinix Metal.

Prerequisites

This guide assumes the user has a working API token, the Equinix Metal CLI installed, and some familiarity with the CLI.

Network Booting

To install Talos to a server a working TFTP and iPXE server are needed. How this is done varies and is left as an exercise for the user. In general this requires a Talos kernel vmlinuz and initramfs. These assets can be downloaded from a given release.

Special Considerations

PXE Boot Kernel Parameters

The following is a list of kernel parameters required by Talos:

  • talos.platform: set this to packet
  • init_on_alloc=1: required by KSPP
  • slab_nomerge: required by KSPP
  • pti=on: required by KSPP

User Data

To configure a Talos you can use the metadata service provide by Equinix Metal. It is required to add a shebang to the top of the configuration file. The shebang is arbitrary in the case of Talos, and the convention we use is #!talos.

Creating a Cluster via the Equinix Metal CLI

Control Plane Endpoint

The strategy used for an HA cluster varies and is left as an exercise for the user. Some of the known ways are:

  • DNS
  • Load Balancer
  • BGP

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-aws-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig

Now add the required shebang (e.g. #!talos) at the top of controlplane.yaml, and worker.yaml At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

talosctl validate --config controlplane.yaml --mode metal
talosctl validate --config worker.yaml --mode metal

Note: Validation of the install disk could potentially fail as the validation is performed on you local machine and the specified disk may not exist.

Create the Control Plane Nodes

metal device create \
  --project-id $PROJECT_ID \
  --facility $FACILITY \
  --ipxe-script-url $PXE_SERVER \
  --operating-system "custom_ipxe" \
  --plan $PLAN\
  --hostname $HOSTNAME\
  --userdata-file controlplane.yaml

Note: The above should be invoked at least twice in order for etcd to form quorum.

Create the Worker Nodes

metal device create \
  --project-id $PROJECT_ID \
  --facility $FACILITY \
  --ipxe-script-url $PXE_SERVER \
  --operating-system "custom_ipxe" \
  --plan $PLAN\
  --hostname $HOSTNAME\
  --userdata-file worker.yaml

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

2.3 - Matchbox

In this guide we will create an HA Kubernetes cluster with 3 worker nodes using an existing load balancer and matchbox deployment.

Creating a Cluster

In this guide we will create an HA Kubernetes cluster with 3 worker nodes. We assume an existing load balancer, matchbox deployment, and some familiarity with iPXE.

We leave it up to the user to decide if they would like to use static networking, or DHCP. The setup and configuration of DHCP will not be covered.

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the load balancer, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-metal-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode metal
controlplane.yaml is valid for metal mode
$ talosctl validate --config worker.yaml --mode metal
worker.yaml is valid for metal mode

Publishing the Machine Configuration Files

In bare-metal setups it is up to the user to provide the configuration files over HTTP(S). A special kernel parameter (talos.config) must be used to inform Talos about where it should retreive its’ configuration file. To keep things simple we will place controlplane.yaml, and worker.yaml into Matchbox’s assets directory. This directory is automatically served by Matchbox.

Create the Matchbox Configuration Files

The profiles we will create will reference vmlinuz, and initramfs.xz. Download these files from the release of your choice, and place them in /var/lib/matchbox/assets.

Profiles

Control Plane Nodes
{
  "id": "control-plane",
  "name": "control-plane",
  "boot": {
    "kernel": "/assets/vmlinuz",
    "initrd": ["/assets/initramfs.xz"],
    "args": [
      "initrd=initramfs.xz",
      "init_on_alloc=1",
      "slab_nomerge",
      "pti=on",
      "console=tty0",
      "console=ttyS0",
      "printk.devkmsg=on",
      "talos.platform=metal",
      "talos.config=http://matchbox.talos.dev/assets/controlplane.yaml"
    ]
  }
}

Note: Be sure to change http://matchbox.talos.dev to the endpoint of your matchbox server.

Worker Nodes
{
  "id": "default",
  "name": "default",
  "boot": {
    "kernel": "/assets/vmlinuz",
    "initrd": ["/assets/initramfs.xz"],
    "args": [
      "initrd=initramfs.xz",
      "init_on_alloc=1",
      "slab_nomerge",
      "pti=on",
      "console=tty0",
      "console=ttyS0",
      "printk.devkmsg=on",
      "talos.platform=metal",
      "talos.config=http://matchbox.talos.dev/assets/worker.yaml"
    ]
  }
}

Groups

Now, create the following groups, and ensure that the selectors are accurate for your specific setup.

{
  "id": "control-plane-1",
  "name": "control-plane-1",
  "profile": "control-plane",
  "selector": {
    ...
  }
}
{
  "id": "control-plane-2",
  "name": "control-plane-2",
  "profile": "control-plane",
  "selector": {
    ...
  }
}
{
  "id": "control-plane-3",
  "name": "control-plane-3",
  "profile": "control-plane",
  "selector": {
    ...
  }
}
{
  "id": "default",
  "name": "default",
  "profile": "default"
}

Boot the Machines

Now that we have our configuraton files in place, boot all the machines. Talos will come up on each machine, grab its’ configuration file, and bootstrap itself.

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

2.4 - Sidero

Sidero is a project created by the Talos team that has native support for Talos.

Sidero is a project created by the Talos team that has native support for Talos. The best way to get started with Sidero is to visit the website.

3 - Virtualized Platforms

3.1 - Hyper-V

Talos is known to work on Hyper-V; however, it is currently undocumented.

3.2 - KVM

Talos is known to work on KVM; however, it is currently undocumented.

3.3 - Proxmox

Creating Talos Kubernetes cluster using Proxmox.

In this guide we will create a Kubernetes cluster using Proxmox.

Video Walkthrough

To see a live demo of this writeup, visit Youtube here:

Installation

How to Get Proxmox

It is assumed that you have already installed Proxmox onto the server you wish to create Talos VMs on. Visit the Proxmox downloads page if necessary.

Install talosctl

You can download talosctl via github.com/talos-systems/talos/releases

curl https://github.com/siderolabs/talos/releases/download/<version>/talosctl-<platform>-<arch> -L -o talosctl

For example version v0.14.0 for linux platform:

curl https://github.com/talos-systems/talos/releases/latest/download/talosctl-linux-amd64 -L -o talosctl
sudo cp talosctl /usr/local/bin
sudo chmod +x /usr/local/bin/talosctl

Download ISO Image

In order to install Talos in Proxmox, you will need the ISO image from the Talos release page. You can download talos-amd64.iso via github.com/talos-systems/talos/releases

mkdir -p _out/
curl https://github.com/siderolabs/talos/releases/download/<version>/talos-<arch>.iso -L -o _out/talos-<arch>.iso

For example version v0.14.0 for linux platform:

mkdir -p _out/
curl https://github.com/talos-systems/talos/releases/latest/download/talos-amd64.iso -L -o _out/talos-amd64.iso

Upload ISO

From the Proxmox UI, select the “local” storage and enter the “Content” section. Click the “Upload” button:

Select the ISO you downloaded previously, then hit “Upload”

Create VMs

Start by creating a new VM by clicking the “Create VM” button in the Proxmox UI:

Fill out a name for the new VM:

In the OS tab, select the ISO we uploaded earlier:

Keep the defaults set in the “System” tab.

Keep the defaults in the “Hard Disk” tab as well, only changing the size if desired.

In the “CPU” section, give at least 2 cores to the VM:

Verify that the RAM is set to at least 2GB:

Keep the default values for networking, verifying that the VM is set to come up on the bridge interface:

Finish creating the VM by clicking through the “Confirm” tab and then “Finish”.

Repeat this process for a second VM to use as a worker node. You can also repeat this for additional nodes desired.

Start Control Plane Node

Once the VMs have been created and updated, start the VM that will be the first control plane node. This VM will boot the ISO image specified earlier and enter “maintenance mode”.

With DHCP server

Once the machine has entered maintenance mode, there will be a console log that details the IP address that the node received. Take note of this IP address, which will be referred to as $CONTROL_PLANE_IP for the rest of this guide. If you wish to export this IP as a bash variable, simply issue a command like export CONTROL_PLANE_IP=1.2.3.4.

Without DHCP server

To apply the machine configurations in maintenance mode, VM has to have IP on the network. So you can set it on boot time manually.

Press e on the boot time. And set the IP parameters for the VM. Format is:

ip=<client-ip>:<srv-ip>:<gw-ip>:<netmask>:<host>:<device>:<autoconf>

For example $CONTROL_PLANE_IP will be 192.168.0.100 and gateway 192.168.0.1

linux /boot/vmlinuz init_on_alloc=1 slab_nomerge pti=on panic=0 consoleblank=0 printk.devkmsg=on earlyprintk=ttyS0 console=tty0 console=ttyS0 talos.platform=metal ip=192.168.0.100::192.168.0.1:255.255.255.0::eth0:off

Then press Ctrl-x or F10

Generate Machine Configurations

With the IP address above, you can now generate the machine configurations to use for installing Talos and Kubernetes. Issue the following command, updating the output directory, cluster name, and control plane IP as you see fit:

talosctl gen config talos-vbox-cluster https://$CONTROL_PLANE_IP:6443 --output-dir _out

This will create several files in the _out directory: controlplane.yaml, worker.yaml, and talosconfig.

Create Control Plane Node

Using the controlplane.yaml generated above, you can now apply this config using talosctl. Issue:

talosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file _out/controlplane.yaml

You should now see some action in the Proxmox console for this VM. Talos will be installed to disk, the VM will reboot, and then Talos will configure the Kubernetes control plane on this VM.

Note: This process can be repeated multiple times to create an HA control plane.

Create Worker Node

Create at least a single worker node using a process similar to the control plane creation above. Start the worker node VM and wait for it to enter “maintenance mode”. Take note of the worker node’s IP address, which will be referred to as $WORKER_IP

Issue:

talosctl apply-config --insecure --nodes $WORKER_IP --file _out/worker.yaml

Note: This process can be repeated multiple times to add additional workers.

Using the Cluster

Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster. For example, to view current running containers, run talosctl containers for a list of containers in the system namespace, or talosctl containers -k for the k8s.io namespace. To view the logs of a container, use talosctl logs <container> or talosctl logs -k <container>.

First, configure talosctl to talk to your control plane node by issuing the following, updating paths and IPs as necessary:

export TALOSCONFIG="_out/talosconfig"
talosctl config endpoint $CONTROL_PLANE_IP
talosctl config node $CONTROL_PLANE_IP

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig _out/talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig _out/talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig _out/talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig _out/talosconfig kubeconfig .

Cleaning Up

To cleanup, simply stop and delete the virtual machines from the Proxmox UI.

3.4 - VMware

Creating Talos Kubernetes cluster using VMware.

Creating a Cluster via the govc CLI

In this guide we will create an HA Kubernetes cluster with 2 worker nodes. We will use the govc cli which can be downloaded here.

Prereqs/Assumptions

This guide will use the virtual IP (“VIP”) functionality that is built into Talos in order to provide a stable, known IP for the Kubernetes control plane. This simply means the user should pick an IP on their “VM Network” to designate for this purpose and keep it handy for future steps.

Create the Machine Configuration Files

Generating Base Configurations

Using the VIP chosen in the prereq steps, we will now generate the base configuration files for the Talos machines. This can be done with the talosctl gen config ... command. Take note that we will also use a JSON6902 patch when creating the configs so that the control plane nodes get some special information about the VIP we chose earlier, as well as a daemonset to install vmware tools on talos nodes.

First, download the cp.patch to your local machine and edit the VIP to match your chosen IP. You can do this by issuing https://raw.githubusercontent.com/talos-systems/talos/master/website/content/docs/v1.0/Virtualized%20Platforms/vmware/cp.patch. It’s contents should look like the following:

- op: add
  path: /machine/network
  value:
    interfaces:
    - interface: eth0
      dhcp: true
      vip:
        ip: <VIP>
- op: replace
  path: /cluster/extraManifests
  value:
    - "https://raw.githubusercontent.com/mologie/talos-vmtoolsd/master/deploy/unstable.yaml"

With the patch in hand, generate machine configs with:

$ talosctl gen config vmware-test https://<VIP>:<port> --config-patch-control-plane "$(yq r -j cp.patch)"
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking if needed. Optionally, you can specify additional patches by adding to the cp.patch file downloaded earlier, or create your own patch files.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Set Environment Variables

govc makes use of the following environment variables

export GOVC_URL=<vCenter url>
export GOVC_USERNAME=<vCenter username>
export GOVC_PASSWORD=<vCenter password>

Note: If your vCenter installation makes use of self signed certificates, you’ll want to export GOVC_INSECURE=true.

There are some additional variables that you may need to set:

export GOVC_DATACENTER=<vCenter datacenter>
export GOVC_RESOURCE_POOL=<vCenter resource pool>
export GOVC_DATASTORE=<vCenter datastore>
export GOVC_NETWORK=<vCenter network>

Choose Install Approach

As part of this guide, we have a more automated install script that handles some of the complexity of importing OVAs and creating VMs. If you wish to use this script, we will detail that next. If you wish to carry out the manual approach, simply skip ahead to the “Manual Approach” section.

Scripted Install

Download the vmware.sh script to your local machine. You can do this by issuing curl -fsSLO "https://raw.githubusercontent.com/talos-systems/talos/master/website/content/docs/v1.0/Virtualized%20Platforms/vmware/vmware.sh". This script has default variables for things like Talos version and cluster name that may be interesting to tweak before deploying.

Import OVA

To create a content library and import the Talos OVA corresponding to the mentioned Talos version, simply issue:

./vsphere.sh upload_ova

Create Cluster

With the OVA uploaded to the content library, you can create a 5 node (by default) cluster with 3 control plane and 2 worker nodes:

./vsphere.sh create

This step will create a VM from the OVA, edit the settings based on the env variables used for VM size/specs, then power on the VMs.

You may now skip past the “Manual Approach” section down to “Bootstrap Cluster”.

Manual Approach

Import the OVA into vCenter

A talos.ova asset is published with each release. We will refer to the version of the release as $TALOS_VERSION below. It can be easily exported with export TALOS_VERSION="v0.3.0-alpha.10" or similar.

curl -LO https://github.com/siderolabs/talos/releases/download/$TALOS_VERSION/talos.ova

Create a content library (if needed) with:

govc library.create <library name>

Import the OVA to the library with:

govc library.import -n talos-${TALOS_VERSION} <library name> /path/to/downloaded/talos.ova

Create the Bootstrap Node

We’ll clone the OVA to create the bootstrap node (our first control plane node).

govc library.deploy <library name>/talos-${TALOS_VERSION} control-plane-1

Talos makes use of the guestinfo facility of VMware to provide the machine/cluster configuration. This can be set using the govc vm.change command. To facilitate persistent storage using the vSphere cloud provider integration with Kubernetes, disk.enableUUID=1 is used.

govc vm.change \
  -e "guestinfo.talos.config=$(cat controlplane.yaml | base64)" \
  -e "disk.enableUUID=1" \
  -vm control-plane-1

Update Hardware Resources for the Bootstrap Node

  • -c is used to configure the number of cpus
  • -m is used to configure the amount of memory (in MB)
govc vm.change \
  -c 2 \
  -m 4096 \
  -vm control-plane-1

The following can be used to adjust the ephemeral disk size.

govc vm.disk.change -vm control-plane-1 -disk.name disk-1000-0 -size 10G
govc vm.power -on control-plane-1

Create the Remaining Control Plane Nodes

govc library.deploy <library name>/talos-${TALOS_VERSION} control-plane-2
govc vm.change \
  -e "guestinfo.talos.config=$(base64 controlplane.yaml)" \
  -e "disk.enableUUID=1" \
  -vm control-plane-2

govc library.deploy <library name>/talos-${TALOS_VERSION} control-plane-3
govc vm.change \
  -e "guestinfo.talos.config=$(base64 controlplane.yaml)" \
  -e "disk.enableUUID=1" \
  -vm control-plane-3
govc vm.change \
  -c 2 \
  -m 4096 \
  -vm control-plane-2

govc vm.change \
  -c 2 \
  -m 4096 \
  -vm control-plane-3
govc vm.disk.change -vm control-plane-2 -disk.name disk-1000-0 -size 10G

govc vm.disk.change -vm control-plane-3 -disk.name disk-1000-0 -size 10G
govc vm.power -on control-plane-2

govc vm.power -on control-plane-3

Update Settings for the Worker Nodes

govc library.deploy <library name>/talos-${TALOS_VERSION} worker-1
govc vm.change \
  -e "guestinfo.talos.config=$(base64 worker.yaml)" \
  -e "disk.enableUUID=1" \
  -vm worker-1

govc library.deploy <library name>/talos-${TALOS_VERSION} worker-2
govc vm.change \
  -e "guestinfo.talos.config=$(base64 worker.yaml)" \
  -e "disk.enableUUID=1" \
  -vm worker-2
govc vm.change \
  -c 4 \
  -m 8192 \
  -vm worker-1

govc vm.change \
  -c 4 \
  -m 8192 \
  -vm worker-2
govc vm.disk.change -vm worker-1 -disk.name disk-1000-0 -size 10G

govc vm.disk.change -vm worker-2 -disk.name disk-1000-0 -size 10G
govc vm.power -on worker-1

govc vm.power -on worker-2

Bootstrap Cluster

In the vSphere UI, open a console to one of the control plane nodes. You should see some output stating that etcd should be bootstrapped. This text should look like:

"etcd is waiting to join the cluster, if this node is the first node in the cluster, please run `talosctl bootstrap` against one of the following IPs:

Take note of the IP mentioned here and issue:

talosctl --talosconfig talosconfig bootstrap -e <control plane IP> -n <control plane IP>

Keep this IP handy for the following steps as well.

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig config endpoint <control plane IP>
talosctl --talosconfig talosconfig config node <control plane IP>
talosctl --talosconfig talosconfig kubeconfig .

Configure talos-vmtoolsd

The talos-vmtoolsd application was deployed as a daemonset as part of the cluster creation; however, we must now provide a talos credentials file for it to use.

Create a new talosconfig with:

talosctl -n <control plane IP> config new vmtoolsd-secret.yaml --roles os:admin

Create a secret from the talosconfig:

kubectl -n kube-system create secret generic talos-vmtoolsd-config \
  --from-file=talosconfig=./vmtoolsd-secret.yaml

Clean up the generated file from local system:

rm vmtoolsd-secret.yaml

Once configured, you should now see these daemonset pods go into “Running” state and in vCenter, you will now see IPs and info from the Talos nodes present in the UI.

3.5 - Xen

Talos is known to work on Xen; however, it is currently undocumented.

4 - Cloud Platforms

4.1 - AWS

Creating a cluster via the AWS CLI.

Official AMI Images

Official AMI image ID can be found in the cloud-images.json file attached to the Talos release:

curl -sL https://github.com/siderolabs/talos/releases/download/v0.14.0/cloud-images.json | \
    jq -r '.[] | select(.region == "us-east-1") | select (.arch == "amd64") | .id'

Replace us-east-1 and amd64 in the line above with the desired region and architecture.

Creating a Cluster via the AWS CLI

In this guide we will create an HA Kubernetes cluster with 3 worker nodes. We assume an existing VPC, and some familiarity with AWS. If you need more information on AWS specifics, please see the official AWS documentation.

Create the Subnet

aws ec2 create-subnet \
    --region $REGION \
    --vpc-id $VPC \
    --cidr-block ${CIDR_BLOCK}

Create the AMI

Prepare the Import Prerequisites

Create the S3 Bucket
aws s3api create-bucket \
    --bucket $BUCKET \
    --create-bucket-configuration LocationConstraint=$REGION \
    --acl private
Create the vmimport Role

In order to create an AMI, ensure that the vmimport role exists as described in the official AWS documentation.

Note that the role should be associated with the S3 bucket we created above.

Create the Image Snapshot

First, download the AWS image from a Talos release:

curl -LO https://github.com/talos-systems/talos/releases/latest/download/aws-amd64.tar.gz | tar -xv

Copy the RAW disk to S3 and import it as a snapshot:

aws s3 cp disk.raw s3://$BUCKET/talos-aws-tutorial.raw
aws ec2 import-snapshot \
    --region $REGION \
    --description "Talos kubernetes tutorial" \
    --disk-container "Format=raw,UserBucket={S3Bucket=$BUCKET,S3Key=talos-aws-tutorial.raw}"

Save the SnapshotId, as we will need it once the import is done. To check on the status of the import, run:

aws ec2 describe-import-snapshot-tasks \
    --region $REGION \
    --import-task-ids

Once the SnapshotTaskDetail.Status indicates completed, we can register the image.

Register the Image
aws ec2 register-image \
    --region $REGION \
    --block-device-mappings "DeviceName=/dev/xvda,VirtualName=talos,Ebs={DeleteOnTermination=true,SnapshotId=$SNAPSHOT,VolumeSize=4,VolumeType=gp2}" \
    --root-device-name /dev/xvda \
    --virtualization-type hvm \
    --architecture x86_64 \
    --ena-support \
    --name talos-aws-tutorial-ami

We now have an AMI we can use to create our cluster. Save the AMI ID, as we will need it when we create EC2 instances.

Create a Security Group

aws ec2 create-security-group \
    --region $REGION \
    --group-name talos-aws-tutorial-sg \
    --description "Security Group for EC2 instances to allow ports required by Talos"

Using the security group ID from above, allow all internal traffic within the same security group:

aws ec2 authorize-security-group-ingress \
    --region $REGION \
    --group-name talos-aws-tutorial-sg \
    --protocol all \
    --port 0 \
    --source-group $SECURITY_GROUP

and expose the Talos and Kubernetes APIs:

aws ec2 authorize-security-group-ingress \
    --region $REGION \
    --group-name talos-aws-tutorial-sg \
    --protocol tcp \
    --port 6443 \
    --cidr 0.0.0.0/0

aws ec2 authorize-security-group-ingress \
    --region $REGION \
    --group-name talos-aws-tutorial-sg \
    --protocol tcp \
    --port 50000-50001 \
    --cidr 0.0.0.0/0

Create a Load Balancer

aws elbv2 create-load-balancer \
    --region $REGION \
    --name talos-aws-tutorial-lb \
    --type network --subnets $SUBNET

Take note of the DNS name and ARN. We will need these soon.

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-aws-tutorial https://<load balancer IP or DNS>:<port> --with-examples=false --with-docs=false
created controlplane.yaml
created worker.yaml
created talosconfig

Take note that the generated configs are too long for AWS userdata field if the --with-examples and --with-docs flags are not passed.

At this point, you can modify the generated configs to your liking.

Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Create the EC2 Instances

Note: There is a known issue that prevents Talos from running on T2 instance types. Please use T3 if you need burstable instance types.

Create the Control Plane Nodes

CP_COUNT=1
while [[ "$CP_COUNT" -lt 4 ]]; do
  aws ec2 run-instances \
    --region $REGION \
    --image-id $AMI \
    --count 1 \
    --instance-type t3.small \
    --user-data file://controlplane.yaml \
    --subnet-id $SUBNET \
    --security-group-ids $SECURITY_GROUP \
    --associate-public-ip-address \
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=talos-aws-tutorial-cp-$CP_COUNT}]"
  ((CP_COUNT++))
done

Make a note of the resulting PrivateIpAddress from the init and controlplane nodes for later use.

Create the Worker Nodes

aws ec2 run-instances \
    --region $REGION \
    --image-id $AMI \
    --count 3 \
    --instance-type t3.small \
    --user-data file://worker.yaml \
    --subnet-id $SUBNET \
    --security-group-ids $SECURITY_GROUP
    --tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=talos-aws-tutorial-worker}]"

Configure the Load Balancer

aws elbv2 create-target-group \
    --region $REGION \
    --name talos-aws-tutorial-tg \
    --protocol TCP \
    --port 6443 \
    --target-type ip \
    --vpc-id $VPC

Now, using the target group’s ARN, and the PrivateIpAddress from the instances that you created :

aws elbv2 register-targets \
    --region $REGION \
    --target-group-arn $TARGET_GROUP_ARN \
    --targets Id=$CP_NODE_1_IP  Id=$CP_NODE_2_IP  Id=$CP_NODE_3_IP

Using the ARNs of the load balancer and target group from previous steps, create the listener:

aws elbv2 create-listener \
    --region $REGION \
    --load-balancer-arn $LOAD_BALANCER_ARN \
    --protocol TCP \
    --port 443 \
    --default-actions Type=forward,TargetGroupArn=$TARGET_GROUP_ARN

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

4.2 - Azure

Creating a cluster via the CLI on Azure.

Creating a Cluster via the CLI

In this guide we will create an HA Kubernetes cluster with 1 worker node. We assume existing Blob Storage, and some familiarity with Azure. If you need more information on Azure specifics, please see the official Azure documentation.

Environment Setup

We’ll make use of the following environment variables throughout the setup. Edit the variables below with your correct information.

# Storage account to use
export STORAGE_ACCOUNT="StorageAccountName"

# Storage container to upload to
export STORAGE_CONTAINER="StorageContainerName"

# Resource group name
export GROUP="ResourceGroupName"

# Location
export LOCATION="centralus"

# Get storage account connection string based on info above
export CONNECTION=$(az storage account show-connection-string \
                    -n $STORAGE_ACCOUNT \
                    -g $GROUP \
                    -o tsv)

Create the Image

First, download the Azure image from a Talos release. Once downloaded, untar with tar -xvf /path/to/azure-amd64.tar.gz

Upload the VHD

Once you have pulled down the image, you can upload it to blob storage with:

az storage blob upload \
  --connection-string $CONNECTION \
  --container-name $STORAGE_CONTAINER \
  -f /path/to/extracted/talos-azure.vhd \
  -n talos-azure.vhd

Register the Image

Now that the image is present in our blob storage, we’ll register it.

az image create \
  --name talos \
  --source https://$STORAGE_ACCOUNT.blob.core.windows.net/$STORAGE_CONTAINER/talos-azure.vhd \
  --os-type linux \
  -g $GROUP

Network Infrastructure

Virtual Networks and Security Groups

Once the image is prepared, we’ll want to work through setting up the network. Issue the following to create a network security group and add rules to it.

# Create vnet
az network vnet create \
  --resource-group $GROUP \
  --location $LOCATION \
  --name talos-vnet \
  --subnet-name talos-subnet

# Create network security group
az network nsg create -g $GROUP -n talos-sg

# Client -> apid
az network nsg rule create \
  -g $GROUP \
  --nsg-name talos-sg \
  -n apid \
  --priority 1001 \
  --destination-port-ranges 50000 \
  --direction inbound

# Trustd
az network nsg rule create \
  -g $GROUP \
  --nsg-name talos-sg \
  -n trustd \
  --priority 1002 \
  --destination-port-ranges 50001 \
  --direction inbound

# etcd
az network nsg rule create \
  -g $GROUP \
  --nsg-name talos-sg \
  -n etcd \
  --priority 1003 \
  --destination-port-ranges 2379-2380 \
  --direction inbound

# Kubernetes API Server
az network nsg rule create \
  -g $GROUP \
  --nsg-name talos-sg \
  -n kube \
  --priority 1004 \
  --destination-port-ranges 6443 \
  --direction inbound

Load Balancer

We will create a public ip, load balancer, and a health check that we will use for our control plane.

# Create public ip
az network public-ip create \
  --resource-group $GROUP \
  --name talos-public-ip \
  --allocation-method static

# Create lb
az network lb create \
  --resource-group $GROUP \
  --name talos-lb \
  --public-ip-address talos-public-ip \
  --frontend-ip-name talos-fe \
  --backend-pool-name talos-be-pool

# Create health check
az network lb probe create \
  --resource-group $GROUP \
  --lb-name talos-lb \
  --name talos-lb-health \
  --protocol tcp \
  --port 6443

# Create lb rule for 6443
az network lb rule create \
  --resource-group $GROUP \
  --lb-name talos-lb \
  --name talos-6443 \
  --protocol tcp \
  --frontend-ip-name talos-fe \
  --frontend-port 6443 \
  --backend-pool-name talos-be-pool \
  --backend-port 6443 \
  --probe-name talos-lb-health

Network Interfaces

In Azure, we have to pre-create the NICs for our control plane so that they can be associated with our load balancer.

for i in $( seq 0 1 2 ); do
  # Create public IP for each nic
  az network public-ip create \
    --resource-group $GROUP \
    --name talos-controlplane-public-ip-$i \
    --allocation-method static


  # Create nic
  az network nic create \
    --resource-group $GROUP \
    --name talos-controlplane-nic-$i \
    --vnet-name talos-vnet \
    --subnet talos-subnet \
    --network-security-group talos-sg \
    --public-ip-address talos-controlplane-public-ip-$i\
    --lb-name talos-lb \
    --lb-address-pools talos-be-pool
done

# NOTES:
# Talos can detect PublicIPs automatically if PublicIP SKU is Basic.
# Use `--sku Basic` to set SKU to Basic.

Cluster Configuration

With our networking bits setup, we’ll fetch the IP for our load balancer and create our configuration files.

LB_PUBLIC_IP=$(az network public-ip show \
              --resource-group $GROUP \
              --name talos-public-ip \
              --query [ipAddress] \
              --output tsv)

talosctl gen config talos-k8s-azure-tutorial https://${LB_PUBLIC_IP}:6443

Compute Creation

We are now ready to create our azure nodes. Azure allows you to pass Talos machine configuration to the virtual machine at bootstrap time via user-data or custom-data methods.

Talos supports only custom-data method, machine configuration is available to the VM only on the first boot.

# Create availability set
az vm availability-set create \
  --name talos-controlplane-av-set \
  -g $GROUP

# Create the controlplane nodes
for i in $( seq 0 1 2 ); do
  az vm create \
    --name talos-controlplane-$i \
    --image talos \
    --custom-data ./controlplane.yaml \
    -g $GROUP \
    --admin-username talos \
    --generate-ssh-keys \
    --verbose \
    --boot-diagnostics-storage $STORAGE_ACCOUNT \
    --os-disk-size-gb 20 \
    --nics talos-controlplane-nic-$i \
    --availability-set talos-controlplane-av-set \
    --no-wait
done

# Create worker node
  az vm create \
    --name talos-worker-0 \
    --image talos \
    --vnet-name talos-vnet \
    --subnet talos-subnet \
    --custom-data ./worker.yaml \
    -g $GROUP \
    --admin-username talos \
    --generate-ssh-keys \
    --verbose \
    --boot-diagnostics-storage $STORAGE_ACCOUNT \
    --nsg talos-sg \
    --os-disk-size-gb 20 \
    --no-wait

# NOTES:
# `--admin-username` and `--generate-ssh-keys` are required by the az cli,
# but are not actually used by talos
# `--os-disk-size-gb` is the backing disk for Kubernetes and any workload containers
# `--boot-diagnostics-storage` is to enable console output which may be necessary
# for troubleshooting

Bootstrap Etcd

You should now be able to interact with your cluster with talosctl. We will need to discover the public IP for our first control plane node first.

CONTROL_PLANE_0_IP=$(az network public-ip show \
                    --resource-group $GROUP \
                    --name talos-controlplane-public-ip-0 \
                    --query [ipAddress] \
                    --output tsv)

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint $CONTROL_PLANE_0_IP
talosctl --talosconfig talosconfig config node $CONTROL_PLANE_0_IP

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

4.3 - DigitalOcean

Creating a cluster via the CLI on DigitalOcean.

Creating a Cluster via the CLI

In this guide we will create an HA Kubernetes cluster with 1 worker node. We assume an existing Space, and some familiarity with DigitalOcean. If you need more information on DigitalOcean specifics, please see the official DigitalOcean documentation.

Create the Image

First, download the DigitalOcean image from a Talos release. Extract the archive to get the disk.raw file, compress it using gzip to disk.raw.gz.

Using an upload method of your choice (doctl does not have Spaces support), upload the image to a space. Now, create an image using the URL of the uploaded image:

doctl compute image create \
    --region $REGION \
    --image-description talos-digital-ocean-tutorial \
    --image-url https://talos-tutorial.$REGION.digitaloceanspaces.com/disk.raw.gz \
    Talos

Save the image ID. We will need it when creating droplets.

Create a Load Balancer

doctl compute load-balancer create \
    --region $REGION \
    --name talos-digital-ocean-tutorial-lb \
    --tag-name talos-digital-ocean-tutorial-control-plane \
    --health-check protocol:tcp,port:6443,check_interval_seconds:10,response_timeout_seconds:5,healthy_threshold:5,unhealthy_threshold:3 \
    --forwarding-rules entry_protocol:tcp,entry_port:443,target_protocol:tcp,target_port:6443

We will need the IP of the load balancer. Using the ID of the load balancer, run:

doctl compute load-balancer get --format IP <load balancer ID>

Save it, as we will need it in the next step.

Create the Machine Configuration Files

Generating Base Configurations

Using the DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines:

$ talosctl gen config talos-k8s-digital-ocean-tutorial https://<load balancer IP or DNS>:<port>
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Validate the Configuration Files

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Create the Droplets

Create the Control Plane Nodes

Run the following twice, to give ourselves three total control plane nodes:

doctl compute droplet create \
    --region $REGION \
    --image <image ID> \
    --size s-2vcpu-4gb \
    --enable-private-networking \
    --tag-names talos-digital-ocean-tutorial-control-plane \
    --user-data-file controlplane.yaml \
    --ssh-keys <ssh key fingerprint> \
    talos-control-plane-1
doctl compute droplet create \
    --region $REGION \
    --image <image ID> \
    --size s-2vcpu-4gb \
    --enable-private-networking \
    --tag-names talos-digital-ocean-tutorial-control-plane \
    --user-data-file controlplane.yaml \
    --ssh-keys <ssh key fingerprint> \
    talos-control-plane-2
doctl compute droplet create \
    --region $REGION \
    --image <image ID> \
    --size s-2vcpu-4gb \
    --enable-private-networking \
    --tag-names talos-digital-ocean-tutorial-control-plane \
    --user-data-file controlplane.yaml \
    --ssh-keys <ssh key fingerprint> \
    talos-control-plane-3

Note: Although SSH is not used by Talos, DigitalOcean still requires that an SSH key be associated with the droplet. Create a dummy key that can be used to satisfy this requirement.

Create the Worker Nodes

Run the following to create a worker node:

doctl compute droplet create \
    --region $REGION \
    --image <image ID> \
    --size s-2vcpu-4gb \
    --enable-private-networking \
    --user-data-file worker.yaml \
    --ssh-keys <ssh key fingerprint> \
    talos-worker-1

Bootstrap Etcd

To configure talosctl we will need the first control plane node’s IP:

doctl compute droplet get --format PublicIPv4 <droplet ID>

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

4.4 - GCP

Creating a cluster via the CLI on Google Cloud Platform.

Creating a Cluster via the CLI

In this guide, we will create an HA Kubernetes cluster in GCP with 1 worker node. We will assume an existing Cloud Storage bucket, and some familiarity with Google Cloud. If you need more information on Google Cloud specifics, please see the official Google documentation.

jq and talosctl also needs to be installed

Manual Setup

Environment Setup

We’ll make use of the following environment variables throughout the setup. Edit the variables below with your correct information.

# Storage account to use
export STORAGE_BUCKET="StorageBucketName"
# Region
export REGION="us-central1"

Create the Image

First, download the Google Cloud image from a Talos release. These images are called gcp-$ARCH.tar.gz.

Upload the Image

Once you have downloaded the image, you can upload it to your storage bucket with:

gsutil cp /path/to/gcp-amd64.tar.gz gs://$STORAGE_BUCKET

Register the image

Now that the image is present in our bucket, we’ll register it.

gcloud compute images create talos \
 --source-uri=gs://$STORAGE_BUCKET/gcp-amd64.tar.gz \
 --guest-os-features=VIRTIO_SCSI_MULTIQUEUE

Network Infrastructure

Load Balancers and Firewalls

Once the image is prepared, we’ll want to work through setting up the network. Issue the following to create a firewall, load balancer, and their required components.

130.211.0.0/22 and 35.191.0.0/16 are the GCP Load Balancer IP ranges

# Create Instance Group
gcloud compute instance-groups unmanaged create talos-ig \
  --zone $REGION-b

# Create port for IG
gcloud compute instance-groups set-named-ports talos-ig \
    --named-ports tcp6443:6443 \
    --zone $REGION-b

# Create health check
gcloud compute health-checks create tcp talos-health-check --port 6443

# Create backend
gcloud compute backend-services create talos-be \
    --global \
    --protocol TCP \
    --health-checks talos-health-check \
    --timeout 5m \
    --port-name tcp6443

# Add instance group to backend
gcloud compute backend-services add-backend talos-be \
    --global \
    --instance-group talos-ig \
    --instance-group-zone $REGION-b

# Create tcp proxy
gcloud compute target-tcp-proxies create talos-tcp-proxy \
    --backend-service talos-be \
    --proxy-header NONE

# Create LB IP
gcloud compute addresses create talos-lb-ip --global

# Forward 443 from LB IP to tcp proxy
gcloud compute forwarding-rules create talos-fwd-rule \
    --global \
    --ports 443 \
    --address talos-lb-ip \
    --target-tcp-proxy talos-tcp-proxy

# Create firewall rule for health checks
gcloud compute firewall-rules create talos-controlplane-firewall \
     --source-ranges 130.211.0.0/22,35.191.0.0/16 \
     --target-tags talos-controlplane \
     --allow tcp:6443

# Create firewall rule to allow talosctl access
gcloud compute firewall-rules create talos-controlplane-talosctl \
  --source-ranges 0.0.0.0/0 \
  --target-tags talos-controlplane \
  --allow tcp:50000

Cluster Configuration

With our networking bits setup, we’ll fetch the IP for our load balancer and create our configuration files.

LB_PUBLIC_IP=$(gcloud compute forwarding-rules describe talos-fwd-rule \
               --global \
               --format json \
               | jq -r .IPAddress)

talosctl gen config talos-k8s-gcp-tutorial https://${LB_PUBLIC_IP}:443

Additionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Compute Creation

We are now ready to create our GCP nodes.

# Create the control plane nodes.
for i in $( seq 1 3 ); do
  gcloud compute instances create talos-controlplane-$i \
    --image talos \
    --zone $REGION-b \
    --tags talos-controlplane \
    --boot-disk-size 20GB \
    --metadata-from-file=user-data=./controlplane.yaml
done

# Add control plane nodes to instance group
for i in $( seq 0 1 3 ); do
  gcloud compute instance-groups unmanaged add-instances talos-ig \
      --zone $REGION-b \
      --instances talos-controlplane-$i
done

# Create worker
gcloud compute instances create talos-worker-0 \
  --image talos \
  --zone $REGION-b \
  --boot-disk-size 20GB \
  --metadata-from-file=user-data=./worker.yaml

Bootstrap Etcd

You should now be able to interact with your cluster with talosctl. We will need to discover the public IP for our first control plane node first.

CONTROL_PLANE_0_IP=$(gcloud compute instances describe talos-controlplane-0 \
                     --zone $REGION-b \
                     --format json \
                     | jq -r '.networkInterfaces[0].accessConfigs[0].natIP')

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint $CONTROL_PLANE_0_IP
talosctl --talosconfig talosconfig config node $CONTROL_PLANE_0_IP

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

Cleanup

# cleanup VM's
gcloud compute instances delete \
  talos-worker-0 \
  talos-controlplane-0 \
  talos-controlplane-1 \
  talos-controlplane-2

# cleanup firewall rules
gcloud compute firewall-rules delete \
  talos-controlplane-talosctl \
  talos-controlplane-firewall

# cleanup forwarding rules
gcloud compute forwarding-rules delete \
  talos-fwd-rule

# cleanup addresses
gcloud compute addresses delete \
  talos-lb-ip

# cleanup proxies
gcloud compute target-tcp-proxies delete \
  talos-tcp-proxy

# cleanup backend services
gcloud compute backend-services delete \
  talos-be

# cleanup health checks
gcloud compute health-checks delete \
  talos-health-check

# cleanup unmanaged instance groups
gcloud compute instance-groups unmanaged delete \
  talos-ig

# cleanup Talos image
gcloud compute images delete \
  talos

Using GCP Deployment manager

Using GCP deployment manager automatically creates a Google Storage bucket and uploads the Talos image to it. Once the deployment is complete the generated talosconfig and kubeconfig files are uploaded to the bucket.

By default this setup creates a three node control plane and a single worker in us-west1-b

First we need to create a folder to store our deployment manifests and perform all subsequent operations from that folder.

mkdir -p talos-gcp-deployment
cd talos-gcp-deployment

Getting the deployment manifests

We need to download two deployment manifests for the deployment from the Talos github repository.

curl -fsSLO "https://raw.githubusercontent.com/talos-systems/talos/master/website/content/docs/v0.14/Cloud%20Platforms/gcp/config.yaml"
curl -fsSLO "https://raw.githubusercontent.com/talos-systems/talos/master/website/content/docs/v0.14/Cloud%20Platforms/gcp/talos-ha.jinja"
# if using ccm
curl -fsSLO "https://raw.githubusercontent.com/talos-systems/talos/master/website/content/docs/v0.14/Cloud%20Platforms/gcp/gcp-ccm.yaml"

Updating the config

Now we need to update the local config.yaml file with any required changes such as changing the default zone, Talos version, machine sizes, nodes count etc.

An example config.yaml file is shown below:

imports:
  - path: talos-ha.jinja

resources:
  - name: talos-ha
    type: talos-ha.jinja
    properties:
      zone: us-west1-b
      talosVersion: 
      externalCloudProvider: false
      controlPlaneNodeCount: 5
      controlPlaneNodeType: n1-standard-1
      workerNodeCount: 3
      workerNodeType: n1-standard-1
outputs:
  - name: bucketName
    value: $(ref.talos-ha.bucketName)

Enabling external cloud provider

Note: The externalCloudProvider property is set to false by default. The manifest used for deploying the ccm (cloud controller manager) is currently using the GCP ccm provided by openshift since there are no public images for the ccm yet.

Since the routes controller is disabled while deploying the CCM, the CNI pods needs to be restarted after the CCM deployment is complete to remove the node.kubernetes.io/network-unavailable taint. See Nodes network-unavailable taint not removed after installing ccm for more information

Use a custom built image for the ccm deployment if required.

Creating the deployment

Now we are ready to create the deployment. Confirm with y for any prompts. Run the following command to create the deployment:

# use a unique name for the deployment, resources are prefixed with the deployment name
export DEPLOYMENT_NAME="<deployment name>"
gcloud deployment-manager deployments create "${DEPLOYMENT_NAME}" --config config.yaml

Retrieving the outputs

First we need to get the deployment outputs.

# first get the outputs
OUTPUTS=$(gcloud deployment-manager deployments describe "${DEPLOYMENT_NAME}" --format json | jq '.outputs[]')

BUCKET_NAME=$(jq -r '. | select(.name == "bucketName").finalValue' <<< "${OUTPUTS}")
# used when cloud controller is enabled
SERVICE_ACCOUNT=$(jq -r '. | select(.name == "serviceAccount").finalValue' <<< "${OUTPUTS}")
PROJECT=$(jq -r '. | select(.name == "project").finalValue' <<< "${OUTPUTS}")

Note: If cloud controller manager is enabled, the below command needs to be run to allow the controller custom role to access cloud resources

gcloud projects add-iam-policy-binding \
    "${PROJECT}" \
    --member "serviceAccount:${SERVICE_ACCOUNT}" \
    --role roles/iam.serviceAccountUser

gcloud projects add-iam-policy-binding \
    "${PROJECT}" \
    --member serviceAccount:"${SERVICE_ACCOUNT}" \
    --role roles/compute.admin

gcloud projects add-iam-policy-binding \
    "${PROJECT}" \
    --member serviceAccount:"${SERVICE_ACCOUNT}" \
    --role roles/compute.loadBalancerAdmin

Downloading talos and kube config

In addition to the talosconfig and kubeconfig files, the storage bucket contains the controlplane.yaml and worker.yaml files used to join additional nodes to the cluster.

gsutil cp "gs://${BUCKET_NAME}/generated/talosconfig" .
gsutil cp "gs://${BUCKET_NAME}/generated/kubeconfig" .

Deploying the cloud controller manager

kubectl \
  --kubeconfig kubeconfig \
  --namespace kube-system \
  apply \
  --filename gcp-ccm.yaml
#  wait for the ccm to be up
kubectl \
  --kubeconfig kubeconfig \
  --namespace kube-system \
  rollout status \
  daemonset cloud-controller-manager

If the cloud controller manager is enabled, we need to restart the CNI pods to remove the node.kubernetes.io/network-unavailable taint.

# restart the CNI pods, in this case flannel
kubectl \
  --kubeconfig kubeconfig \
  --namespace kube-system \
  rollout restart \
  daemonset kube-flannel
# wait for the pods to be restarted
kubectl \
  --kubeconfig kubeconfig \
  --namespace kube-system \
  rollout status \
  daemonset kube-flannel

Check cluster status

kubectl \
  --kubeconfig kubeconfig \
  get nodes

Cleanup deployment

Warning: This will delete the deployment and all resources associated with it.

Run below if cloud controller manager is enabled

gcloud projects remove-iam-policy-binding \
    "${PROJECT}" \
    --member "serviceAccount:${SERVICE_ACCOUNT}" \
    --role roles/iam.serviceAccountUser

gcloud projects remove-iam-policy-binding \
    "${PROJECT}" \
    --member serviceAccount:"${SERVICE_ACCOUNT}" \
    --role roles/compute.admin

gcloud projects remove-iam-policy-binding \
    "${PROJECT}" \
    --member serviceAccount:"${SERVICE_ACCOUNT}" \
    --role roles/compute.loadBalancerAdmin

Now we can finally remove the deployment

# delete the objects in the bucket first
gsutil -m rm -r "gs://${BUCKET_NAME}"
gcloud deployment-manager deployments delete "${DEPLOYMENT_NAME}" --quiet

4.5 - Hetzner

Creating a cluster via the CLI (hcloud) on Hetzner.

Upload image

Hetzner Cloud does not support uploading custom images. You can email their support to get a Talos ISO uploaded by following issues:3599 or you can prepare image snapshot by yourself.

There are two options to upload your own.

  1. Run an instance in rescue mode and replace the system OS with the Talos image
  2. Use Hashicorp packer to prepare an image

Rescue mode

Create a new Server in the Hetzner console. Enable the Hetzner Rescue System for this server and reboot. Upon a reboot, the server will boot a special minimal Linux distribution designed for repair and reinstall. Once running, login to the server using ssh to prepare the system disk by doing the following:

# Check that you in Rescue mode
df

### Result is like:
# udev                   987432         0    987432   0% /dev
# 213.133.99.101:/nfs 308577696 247015616  45817536  85% /root/.oldroot/nfs
# overlay                995672      8340    987332   1% /
# tmpfs                  995672         0    995672   0% /dev/shm
# tmpfs                  398272       572    397700   1% /run
# tmpfs                    5120         0      5120   0% /run/lock
# tmpfs                  199132         0    199132   0% /run/user/0

# Download the Talos image
cd /tmp
wget -O /tmp/talos.raw.xz https://github.com/siderolabs/talos/releases/download/v0.13.0/hcloud-amd64.raw.xz
# Replace system
xz -d -c /tmp/talos.raw.xz | dd of=/dev/sda && sync
# shutdown the instance
shutdown -h now

To make sure disk content is consistent, it is recommended to shut the server down before taking an image (snapshot). Once shutdown, simply create an image (snapshot) from the console. You can now use this snapshot to run Talos on the cloud.

Packer

Install packer to the local machine.

Create a config file for packer to use:

# hcloud.pkr.hcl

packer {
  required_plugins {
    hcloud = {
      version = ">= 1.0.0"
      source  = "github.com/hashicorp/hcloud"
    }
  }
}

variable "talos_version" {
  type    = string
  default = "v0.13.0"
}

locals {
  image = "https://github.com/siderolabs/talos/releases/download/${var.talos_version}/hcloud-amd64.raw.xz"
}

source "hcloud" "talos" {
  rescue       = "linux64"
  image        = "debian-11"
  location     = "hel1"
  server_type  = "cx11"
  ssh_username = "root"

  snapshot_name = "talos system disk"
  snapshot_labels = {
    type    = "infra",
    os      = "talos",
    version = "${var.talos_version}",
  }
}

build {
  sources = ["source.hcloud.talos"]

  provisioner "shell" {
    inline = [
      "apt-get install -y wget",
      "wget -O /tmp/talos.raw.xz ${local.image}",
      "xz -d -c /tmp/talos.raw.xz | dd of=/dev/sda && sync",
    ]
  }
}

Create a new image by issuing the commands shown below. Note that to create a new API token for your Project, switch into the Hetzner Cloud Console choose a Project, go to Access → Security, and create a new token.

# First you need set API Token
export HCLOUD_TOKEN=${TOKEN}

# Upload image
packer init .
packer build .
# Save the image ID
export IMAGE_ID=<image-id-in-packer-output>

After doing this, you can find the snapshot in the console interface.

Creating a Cluster via the CLI

This section assumes you have the hcloud console utility on your local machine.

# Set hcloud context and api key
hcloud context create talos-tutorial

Create a Load Balancer

Create a load balancer by issuing the commands shown below. Save the IP/DNS name, as this info will be used in the next step.

hcloud load-balancer create --name controlplane --network-zone eu-central --type lb11 --label 'type=controlplane'

### Result is like:
# LoadBalancer 484487 created
# IPv4: 49.12.X.X
# IPv6: 2a01:4f8:X:X::1

hcloud load-balancer add-service controlplane \
    --listen-port 6443 --destination-port 6443 --protocol tcp
hcloud load-balancer add-target controlplane \
    --label-selector 'type=controlplane'

Create the Machine Configuration Files

Generating Base Configurations

Using the IP/DNS name of the loadbalancer created earlier, generate the base configuration files for the Talos machines by issuing:

$ talosctl gen config talos-k8s-hcloud-tutorial https://<load balancer IP or DNS>:6443
created controlplane.yaml
created worker.yaml
created talosconfig

At this point, you can modify the generated configs to your liking. Optionally, you can specify --config-patch with RFC6902 jsonpatches which will be applied during the config generation.

Validate the Configuration Files

Validate any edited machine configs with:

$ talosctl validate --config controlplane.yaml --mode cloud
controlplane.yaml is valid for cloud mode
$ talosctl validate --config worker.yaml --mode cloud
worker.yaml is valid for cloud mode

Create the Servers

We can now create our servers. Note that you can find IMAGE_ID in the snapshot section of the console: https://console.hetzner.cloud/projects/$PROJECT_ID/servers/snapshots.

Create the Control Plane Nodes

Create the control plane nodes with:

export IMAGE_ID=<your-image-id>

hcloud server create --name talos-control-plane-1 \
    --image ${IMAGE_ID} \
    --type cx21 --location hel1 \
    --label 'type=controlplane' \
    --user-data-from-file controlplane.yaml

hcloud server create --name talos-control-plane-2 \
    --image ${IMAGE_ID} \
    --type cx21 --location fsn1 \
    --label 'type=controlplane' \
    --user-data-from-file controlplane.yaml

hcloud server create --name talos-control-plane-3 \
    --image ${IMAGE_ID} \
    --type cx21 --location nbg1 \
    --label 'type=controlplane' \
    --user-data-from-file controlplane.yaml

Create the Worker Nodes

Create the worker nodes with the following command, repeating (and incrementing the name counter) as many times as desired.

hcloud server create --name talos-worker-1 \
    --image ${IMAGE_ID} \
    --type cx21 --location hel1 \
    --label 'type=worker' \
    --user-data-from-file worker.yaml

Bootstrap Etcd

To configure talosctl we will need the first control plane node’s IP. This can be found by issuing:

hcloud server list | grep talos-control-plane

Set the endpoints and nodes for your talosconfig with:

talosctl --talosconfig talosconfig config endpoint <control-plane-1-IP>
talosctl --talosconfig talosconfig config node <control-plane-1-IP>

Bootstrap etcd on the first control plane node with:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

4.6 - Nocloud

Creating a cluster via the CLI using qemu.

Talos supports nocloud data source implementation.

There are two ways to configure Talos server with nocloud platform:

  • via SMBIOS “serial number” option
  • using CDROM or USB-flash filesystem

SMBIOS Serial Number

This method requires the network connection to be up (e.g. via DHCP). Configuration is delivered from the HTTP server.

ds=nocloud-net;s=http://10.10.0.1/configs/;h=HOSTNAME

After the network initialization is complete, Talos fetches:

  • the machine config from http://10.10.0.1/configs/user-data
  • the network config (if available) from http://10.10.0.1/configs/network-config

SMBIOS: QEMU

Add the following flag to qemu command line when starting a VM:

qemu-system-x86_64 \
  ...\
  -smbios type=1,serial=ds=nocloud-net;s=http://10.10.0.1/configs/

SMBIOS: Proxmox

Set the source machine config through the serial number on Proxmox GUI.

The Proxmox stores the VM config at /etc/pve/qemu-server/$ID.conf ($ID - VM ID number of virtual machine), you will see something like:

...
smbios1: uuid=ceae4d10,serial=ZHM9bm9jbG91ZC1uZXQ7cz1odHRwOi8vMTAuMTAuMC4xL2NvbmZpZ3Mv,base64=1
...

Where serial holds the base64-encoded string version of ds=nocloud-net;s=http://10.10.0.1/configs/.

CDROM/USB

Talos can also get machine config from local attached storage without any prior network connection being established.

You can provide configs to the server via files on a VFAT or ISO9660 filesystem. The filesystem volume label must be cidata or CIDATA.

Example: QEMU

Create and prepare Talos machine config:

export CONTROL_PLANE_IP=192.168.1.10

talosctl gen config talos-nocloud https://$CONTROL_PLANE_IP:6443 --output-dir _out

Prepare cloud-init configs:

mkdir -p iso
mv _out/controlplane.yaml iso/user-data
echo "local-hostname: controlplane-1" > iso/meta-data
cat > iso/network-config << EOF
version: 1
config:
   - type: physical
     name: eth0
     mac_address: "52:54:00:12:34:00"
     subnets:
        - type: static
          address: 192.168.1.10
          netmask: 255.255.255.0
          gateway: 192.168.1.254
EOF

Create cloud-init iso image

cd iso && genisoimage -output cidata.iso -V cidata -r -J user-data meta-data network-config

Start the VM

qemu-system-x86_64 \
    ...
    -cdrom iso/cidata.iso \
    ...

Example: Proxmox

Proxmox can create cloud-init disk for you. Edit the cloud-init config information in Proxmox as follows, substitute your own information as necessary:

and then update cicustom param at /etc/pve/qemu-server/$ID.conf.

cicustom: user=local:snippets/master-1.yml
ipconfig0: ip=192.168.1.10/24,gw=192.168.10.254
nameserver: 1.1.1.1
searchdomain: local

Note: snippets/master-1.yml is Talos machine config. It is usually located at /var/lib/vz/snippets/master-1.yml. This file must be placed to this path manually, as Proxmox does not support snippet uploading via API/GUI.

Click on Regenerate Image button after the above changes are made.

4.7 - OpenStack

Creating a cluster via the CLI on OpenStack.

Creating a Cluster via the CLI

In this guide, we will create an HA Kubernetes cluster in OpenStack with 1 worker node. We will assume an existing some familiarity with OpenStack. If you need more information on OpenStack specifics, please see the official OpenStack documentation.

Environment Setup

You should have an existing openrc file. This file will provide environment variables necessary to talk to your OpenStack cloud. See here for instructions on fetching this file.

Create the Image

First, download the OpenStack image from a Talos release. These images are called openstack-$ARCH.tar.gz. Untar this file with tar -xvf openstack-$ARCH.tar.gz. The resulting file will be called disk.raw.

Upload the Image

Once you have the image, you can upload to OpenStack with:

openstack image create --public --disk-format raw --file disk.raw talos

Network Infrastructure

Load Balancer and Network Ports

Once the image is prepared, you will need to work through setting up the network. Issue the following to create a load balancer, the necessary network ports for each control plane node, and associations between the two.

Creating loadbalancer:

# Create load balancer, updating vip-subnet-id if necessary
openstack loadbalancer create --name talos-control-plane --vip-subnet-id public

# Create listener
openstack loadbalancer listener create --name talos-control-plane-listener --protocol TCP --protocol-port 6443 talos-control-plane

# Pool and health monitoring
openstack loadbalancer pool create --name talos-control-plane-pool --lb-algorithm ROUND_ROBIN --listener talos-control-plane-listener --protocol TCP
openstack loadbalancer healthmonitor create --delay 5 --max-retries 4 --timeout 10 --type TCP talos-control-plane-pool

Creating ports:

# Create ports for control plane nodes, updating network name if necessary
openstack port create --network shared talos-control-plane-1
openstack port create --network shared talos-control-plane-2
openstack port create --network shared talos-control-plane-3

# Create floating IPs for the ports, so that you will have talosctl connectivity to each control plane
openstack floating ip create --port talos-control-plane-1 public
openstack floating ip create --port talos-control-plane-2 public
openstack floating ip create --port talos-control-plane-3 public

Note: Take notice of the private and public IPs associated with each of these ports, as they will be used in the next step. Additionally, take node of the port ID, as it will be used in server creation.

Associate port’s private IPs to loadbalancer:

# Create members for each port IP, updating subnet-id and address as necessary.
openstack loadbalancer member create --subnet-id shared-subnet --address <PRIVATE IP OF talos-control-plane-1 PORT> --protocol-port 6443 talos-control-plane-pool
openstack loadbalancer member create --subnet-id shared-subnet --address <PRIVATE IP OF talos-control-plane-2 PORT> --protocol-port 6443 talos-control-plane-pool
openstack loadbalancer member create --subnet-id shared-subnet --address <PRIVATE IP OF talos-control-plane-3 PORT> --protocol-port 6443 talos-control-plane-pool

Security Groups

This example uses the default security group in OpenStack. Ports have been opened to ensure that connectivity from both inside and outside the group is possible. You will want to allow, at a minimum, ports 6443 (Kubernetes API server) and 50000 (Talos API) from external sources. It is also recommended to allow communication over all ports from within the subnet.

Cluster Configuration

With our networking bits setup, we’ll fetch the IP for our load balancer and create our configuration files.

LB_PUBLIC_IP=$(openstack loadbalancer show talos-control-plane -f json | jq -r .vip_address)

talosctl gen config talos-k8s-openstack-tutorial https://${LB_PUBLIC_IP}:6443

Additionally, you can specify --config-patch with RFC6902 jsonpatch which will be applied during the config generation.

Compute Creation

We are now ready to create our OpenStack nodes.

Create control plane:

# Create control planes 2 and 3, substituting the same info.
for i in $( seq 1 3 ); do
  openstack server create talos-control-plane-$i --flavor m1.small --nic port-id=talos-control-plane-$i --image talos --user-data /path/to/controlplane.yaml
done

Create worker:

# Update network name as necessary.
openstack server create talos-worker-1 --flavor m1.small --network shared --image talos --user-data /path/to/worker.yaml

Note: This step can be repeated to add more workers.

Bootstrap Etcd

You should now be able to interact with your cluster with talosctl. We will use one of the floating IPs we allocated earlier. It does not matter which one.

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

4.8 - Scaleway

Creating a cluster via the CLI (scw) on scaleway.com.

Talos is known to work on scaleway.com; however, it is currently undocumented.

4.9 - UpCloud

Creating a cluster via the CLI (upctl) on UpCloud.com.

Talos is known to work on UpCloud.com; however, it is currently undocumented.

4.10 - Vultr

Creating a cluster via the CLI (vultr-cli) on Vultr.com.

Talos is known to work on Vultr.com; however, it is currently undocumented.

5 - Local Platforms

5.1 - Docker

Creating Talos Kubernetes cluster using Docker.

In this guide we will create a Kubernetes cluster in Docker, using a containerized version of Talos.

Running Talos in Docker is intended to be used in CI pipelines, and local testing when you need a quick and easy cluster. Furthermore, if you are running Talos in production, it provides an excellent way for developers to develop against the same version of Talos.

Requirements

The follow are requirements for running Talos in Docker:

  • Docker 18.03 or greater
  • a recent version of talosctl

Caveats

Due to the fact that Talos runs in a container, certain APIs are not available when running in Docker. For example upgrade, reset, and APIs like these don’t apply in container mode.

Create the Cluster

Creating a local cluster is as simple as:

talosctl cluster create --wait

Once the above finishes successfully, your talosconfig(~/.talos/config) will be configured to point to the new cluster.

If you are running on MacOS, an additional command is required:

talosctl config --endpoints 127.0.0.1

Note: Startup times can take up to a minute before the cluster is available.

Retrieve and Configure the kubeconfig

talosctl kubeconfig .
kubectl --kubeconfig kubeconfig config set-cluster talos-default --server https://127.0.0.1:6443

Using the Cluster

Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster. For example, to view current running containers, run talosctl containers for a list of containers in the system namespace, or talosctl containers -k for the k8s.io namespace. To view the logs of a container, use talosctl logs <container> or talosctl logs -k <container>.

Cleaning Up

To cleanup, run:

talosctl cluster destroy

5.2 - QEMU

Creating Talos Kubernetes cluster using QEMU VMs.

In this guide we will create a Kubernetes cluster using QEMU.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Requirements

  • Linux
  • a kernel with
    • KVM enabled (/dev/kvm must exist)
    • CONFIG_NET_SCH_NETEM enabled
    • CONFIG_NET_SCH_INGRESS enabled
  • at least CAP_SYS_ADMIN and CAP_NET_ADMIN capabilities
  • QEMU
  • bridge, static and firewall CNI plugins from the standard CNI plugins, and tc-redirect-tap CNI plugin from the awslabs tc-redirect-tap installed to /opt/cni/bin (installed automatically by talosctl)
  • iptables
  • /var/run/netns directory should exist

Installation

How to get QEMU

Install QEMU with your operating system package manager. For example, on Ubuntu for x86:

apt install qemu-system-x86 qemu-kvm

Install talosctl

You can download talosctl and all required binaries via github.com/talos-systems/talos/releases

curl https://github.com/siderolabs/talos/releases/download/<version>/talosctl-<platform>-<arch> -L -o talosctl

For example version v0.14.0 for linux platform:

curl https://github.com/talos-systems/talos/releases/latest/download/talosctl-linux-amd64 -L -o talosctl
sudo cp talosctl /usr/local/bin
sudo chmod +x /usr/local/bin/talosctl

Install Talos kernel and initramfs

QEMU provisioner depends on Talos kernel (vmlinuz) and initramfs (initramfs.xz). These files can be downloaded from the Talos release:

mkdir -p _out/
curl https://github.com/siderolabs/talos/releases/download/<version>/vmlinuz-<arch> -L -o _out/vmlinuz-<arch>
curl https://github.com/siderolabs/talos/releases/download/<version>/initramfs-<arch>.xz -L -o _out/initramfs-<arch>.xz

For example version v0.14.0:

curl https://github.com/siderolabs/talos/releases/download/v0.14.0/vmlinuz-amd64 -L -o _out/vmlinuz-amd64
curl https://github.com/siderolabs/talos/releases/download/v0.14.0/initramfs-amd64.xz -L -o _out/initramfs-amd64.xz

Create the Cluster

For the first time, create root state directory as your user so that you can inspect the logs as non-root user:

mkdir -p ~/.talos/clusters

Create the cluster:

sudo -E talosctl cluster create --provisioner qemu

Before the first cluster is created, talosctl will download the CNI bundle for the VM provisioning and install it to ~/.talos/cni directory.

Once the above finishes successfully, your talosconfig (~/.talos/config) will be configured to point to the new cluster, and kubeconfig will be downloaded and merged into default kubectl config location (~/.kube/config).

Cluster provisioning process can be optimized with registry pull-through caches.

Using the Cluster

Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster. For example, to view current running containers, run talosctl -n 10.5.0.2 containers for a list of containers in the system namespace, or talosctl -n 10.5.0.2 containers -k for the k8s.io namespace. To view the logs of a container, use talosctl -n 10.5.0.2 logs <container> or talosctl -n 10.5.0.2 logs -k <container>.

A bridge interface will be created, and assigned the default IP 10.5.0.1. Each node will be directly accessible on the subnet specified at cluster creation time. A loadbalancer runs on 10.5.0.1 by default, which handles loadbalancing for the Kubernetes APIs.

You can see a summary of the cluster state by running:

$ talosctl cluster show --provisioner qemu
PROVISIONER       qemu
NAME              talos-default
NETWORK NAME      talos-default
NETWORK CIDR      10.5.0.0/24
NETWORK GATEWAY   10.5.0.1
NETWORK MTU       1500

NODES:

NAME                     TYPE           IP         CPU    RAM      DISK
talos-default-master-1   Init           10.5.0.2   1.00   1.6 GB   4.3 GB
talos-default-master-2   ControlPlane   10.5.0.3   1.00   1.6 GB   4.3 GB
talos-default-master-3   ControlPlane   10.5.0.4   1.00   1.6 GB   4.3 GB
talos-default-worker-1   Worker         10.5.0.5   1.00   1.6 GB   4.3 GB

Cleaning Up

To cleanup, run:

sudo -E talosctl cluster destroy --provisioner qemu

Note: In that case that the host machine is rebooted before destroying the cluster, you may need to manually remove ~/.talos/clusters/talos-default.

Manual Clean Up

The talosctl cluster destroy command depends heavily on the clusters state directory. It contains all related information of the cluster. The PIDs and network associated with the cluster nodes.

If you happened to have deleted the state folder by mistake or you would like to cleanup the environment, here are the steps how to do it manually:

Remove VM Launchers

Find the process of talosctl qemu-launch:

ps -elf | grep 'talosctl qemu-launch'

To remove the VMs manually, execute:

sudo kill -s SIGTERM <PID>

Example output, where VMs are running with PIDs 157615 and 157617

ps -elf | grep '[t]alosctl qemu-launch'
0 S root      157615    2835  0  80   0 - 184934 -     07:53 ?        00:00:00 talosctl qemu-launch
0 S root      157617    2835  0  80   0 - 185062 -     07:53 ?        00:00:00 talosctl qemu-launch
sudo kill -s SIGTERM 157615
sudo kill -s SIGTERM 157617

Stopping VMs

Find the process of qemu-system:

ps -elf | grep 'qemu-system'

To stop the VMs manually, execute:

sudo kill -s SIGTERM <PID>

Example output, where VMs are running with PIDs 158065 and 158216

ps -elf | grep qemu-system
2 S root     1061663 1061168 26  80   0 - 1786238 -    14:05 ?        01:53:56 qemu-system-x86_64 -m 2048 -drive format=raw,if=virtio,file=/home/username/.talos/clusters/talos-default/bootstrap-master.disk -smp cpus=2 -cpu max -nographic -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -device virtio-net-pci,netdev=net0,mac=1e:86:c6:b4:7c:c4 -device virtio-rng-pci -no-reboot -boot order=cn,reboot-timeout=5000 -smbios type=1,uuid=7ec0a73c-826e-4eeb-afd1-39ff9f9160ca -machine q35,accel=kvm
2 S root     1061663 1061170 67  80   0 - 621014 -     21:23 ?        00:00:07 qemu-system-x86_64 -m 2048 -drive format=raw,if=virtio,file=/homeusername/.talos/clusters/talos-default/pxe-1.disk -smp cpus=2 -cpu max -nographic -netdev tap,id=net0,ifname=tap0,script=no,downscript=no -device virtio-net-pci,netdev=net0,mac=36:f3:2f:c3:9f:06 -device virtio-rng-pci -no-reboot -boot order=cn,reboot-timeout=5000 -smbios type=1,uuid=ce12a0d0-29c8-490f-b935-f6073ab916a6 -machine q35,accel=kvm
sudo kill -s SIGTERM 1061663
sudo kill -s SIGTERM 1061663

Remove load balancer

Find the process of talosctl loadbalancer-launch:

ps -elf | grep 'talosctl loadbalancer-launch'

To remove the LB manually, execute:

sudo kill -s SIGTERM <PID>

Example output, where loadbalancer is running with PID 157609

ps -elf | grep '[t]alosctl loadbalancer-launch'
4 S root      157609    2835  0  80   0 - 184998 -     07:53 ?        00:00:07 talosctl loadbalancer-launch --loadbalancer-addr 10.5.0.1 --loadbalancer-upstreams 10.5.0.2
sudo kill -s SIGTERM 157609

Remove DHCP server

Find the process of talosctl dhcpd-launch:

ps -elf | grep 'talosctl dhcpd-launch'

To remove the LB manually, execute:

sudo kill -s SIGTERM <PID>

Example output, where loadbalancer is running with PID 157609

ps -elf | grep '[t]alosctl dhcpd-launch'
4 S root      157609    2835  0  80   0 - 184998 -     07:53 ?        00:00:07 talosctl dhcpd-launch --state-path /home/username/.talos/clusters/talos-default --addr 10.5.0.1 --interface talosbd9c32bc
sudo kill -s SIGTERM 157609

Remove network

This is more tricky part as if you have already deleted the state folder. If you didn’t then it is written in the state.yaml in the ~/.talos/clusters/<cluster-name> directory.

sudo cat ~/.talos/clusters/<cluster-name>/state.yaml | grep bridgename
bridgename: talos<uuid>

If you only had one cluster, then it will be the interface with name talos<uuid>

46: talos<uuid>: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether a6:72:f4:0a:d3:9c brd ff:ff:ff:ff:ff:ff
    inet 10.5.0.1/24 brd 10.5.0.255 scope global talos17c13299
       valid_lft forever preferred_lft forever
    inet6 fe80::a472:f4ff:fe0a:d39c/64 scope link
       valid_lft forever preferred_lft forever

To remove this interface:

sudo ip link del talos<uuid>

Remove state directory

To remove the state directory execute:

sudo rm -Rf /home/$USER/.talos/clusters/<cluster-name>

Troubleshooting

Logs

Inspect logs directory

sudo cat ~/.talos/clusters/<cluster-name>/*.log

Logs are saved under <cluster-name>-<role>-<node-id>.log

For example in case of k8s cluster name:

ls -la ~/.talos/clusters/k8s | grep log
-rw-r--r--. 1 root root      69415 Apr 26 20:58 k8s-master-1.log
-rw-r--r--. 1 root root      68345 Apr 26 20:58 k8s-worker-1.log
-rw-r--r--. 1 root root      24621 Apr 26 20:59 lb.log

Inspect logs during the installation

tail -f ~/.talos/clusters/<cluster-name>/*.log

5.3 - VirtualBox

Creating Talos Kubernetes cluster using VurtualBox VMs.

In this guide we will create a Kubernetes cluster using VirtualBox.

Video Walkthrough

To see a live demo of this writeup, visit Youtube here:

Installation

How to Get VirtualBox

Install VirtualBox with your operating system package manager or from the website. For example, on Ubuntu for x86:

apt install virtualbox

Install talosctl

You can download talosctl via github.com/talos-systems/talos/releases

curl https://github.com/siderolabs/talos/releases/download/<version>/talosctl-<platform>-<arch> -L -o talosctl

For example version v0.14.0 for linux platform:

curl https://github.com/talos-systems/talos/releases/latest/download/talosctl-linux-amd64 -L -o talosctl
sudo cp talosctl /usr/local/bin
sudo chmod +x /usr/local/bin/talosctl

Download ISO Image

In order to install Talos in VirtualBox, you will need the ISO image from the Talos release page. You can download talos-amd64.iso via github.com/talos-systems/talos/releases

mkdir -p _out/
curl https://github.com/siderolabs/talos/releases/download/<version>/talos-<arch>.iso -L -o _out/talos-<arch>.iso

For example version v0.14.0 for linux platform:

mkdir -p _out/
curl https://github.com/talos-systems/talos/releases/latest/download/talos-amd64.iso -L -o _out/talos-amd64.iso

Create VMs

Start by creating a new VM by clicking the “New” button in the VirtualBox UI:

Supply a name for this VM, and specify the Type and Version:

Edit the memory to supply at least 2GB of RAM for the VM:

Proceed through the disk settings, keeping the defaults. You can increase the disk space if desired.

Once created, select the VM and hit “Settings”:

In the “System” section, supply at least 2 CPUs:

In the “Network” section, switch the network “Attached To” section to “Bridged Adapter”:

Finally, in the “Storage” section, select the optical drive and, on the right, select the ISO by browsing your filesystem:

Repeat this process for a second VM to use as a worker node. You can also repeat this for additional nodes desired.

Start Control Plane Node

Once the VMs have been created and updated, start the VM that will be the first control plane node. This VM will boot the ISO image specified earlier and enter “maintenance mode”. Once the machine has entered maintenance mode, there will be a console log that details the IP address that the node received. Take note of this IP address, which will be referred to as $CONTROL_PLANE_IP for the rest of this guide. If you wish to export this IP as a bash variable, simply issue a command like export CONTROL_PLANE_IP=1.2.3.4.

Generate Machine Configurations

With the IP address above, you can now generate the machine configurations to use for installing Talos and Kubernetes. Issue the following command, updating the output directory, cluster name, and control plane IP as you see fit:

talosctl gen config talos-vbox-cluster https://$CONTROL_PLANE_IP:6443 --output-dir _out

This will create several files in the _out directory: controlplane.yaml, worker.yaml, and talosconfig.

Create Control Plane Node

Using the controlplane.yaml generated above, you can now apply this config using talosctl. Issue:

talosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file _out/controlplane.yaml

You should now see some action in the VirtualBox console for this VM. Talos will be installed to disk, the VM will reboot, and then Talos will configure the Kubernetes control plane on this VM.

Note: This process can be repeated multiple times to create an HA control plane.

Create Worker Node

Create at least a single worker node using a process similar to the control plane creation above. Start the worker node VM and wait for it to enter “maintenance mode”. Take note of the worker node’s IP address, which will be referred to as $WORKER_IP

Issue:

talosctl apply-config --insecure --nodes $WORKER_IP --file _out/worker.yaml

Note: This process can be repeated multiple times to add additional workers.

Using the Cluster

Once the cluster is available, you can make use of talosctl and kubectl to interact with the cluster. For example, to view current running containers, run talosctl containers for a list of containers in the system namespace, or talosctl containers -k for the k8s.io namespace. To view the logs of a container, use talosctl logs <container> or talosctl logs -k <container>.

First, configure talosctl to talk to your control plane node by issuing the following, updating paths and IPs as necessary:

export TALOSCONFIG="_out/talosconfig"
talosctl config endpoint $CONTROL_PLANE_IP
talosctl config node $CONTROL_PLANE_IP

Bootstrap Etcd

Set the endpoints and nodes:

talosctl --talosconfig talosconfig config endpoint <control plane 1 IP>
talosctl --talosconfig talosconfig config node <control plane 1 IP>

Bootstrap etcd:

talosctl --talosconfig talosconfig bootstrap

Retrieve the kubeconfig

At this point we can retrieve the admin kubeconfig by running:

talosctl --talosconfig talosconfig kubeconfig .

You can then use kubectl in this fashion:

kubectl get nodes

Cleaning Up

To cleanup, simply stop and delete the virtual machines from the VirtualBox UI.

6 - Single Board Computers

6.1 - Banana Pi M64

Installing Talos on Banana Pi M64 SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest alpha talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/talos-systems/talos/releases/latest/download/metal-bananapi_m64-arm64.img.xz
xz -d metal-bananapi_m64-arm64.img.xz

Writing the Image

The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-bananapi_m64-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

6.2 - Libre Computer Board ALL-H3-CC

Installing Talos on Libre Computer Board ALL-H3-CC SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest alpha talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/talos-systems/talos/releases/latest/download/metal-libretech_all_h3_cc_h5-arm64.img.xz
xz -d metal-libretech_all_h3_cc_h5-arm64.img.xz

Writing the Image

The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-libretech_all_h3_cc_h5-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

6.3 - Pine64

Installing Talos on a Pine64 SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest alpha talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/talos-systems/talos/releases/latest/download/metal-pine64-arm64.img.xz
xz -d metal-pine64-arm64.img.xz

Writing the Image

The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-pine64-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

6.4 - Pine64 Rock64

Installing Talos on Pine64 Rock64 SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest alpha talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/talos-systems/talos/releases/latest/download/metal-rock64-arm64.img.xz
xz -d metal-rock64-arm64.img.xz

Writing the Image

The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-rock64-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

6.5 - Radxa ROCK PI 4c

Installing Talos on Radxa ROCK PI 4c SBC using raw disk image.

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest alpha talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Download the Image

Download the image and decompress it:

curl -LO https://github.com/talos-systems/talos/releases/latest/download/metal-rockpi_4-arm64.img.xz
xz -d metal-rockpi_4-arm64.img.xz

Writing the Image

The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

Now dd the image to your SD card:

sudo dd if=metal-rockpi_4-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

Boot Talos from an eMMC or SSD Drive

Note: this is only tested on Rock PI 4c

It is possible to run Talos without any SD cards right from either an eMMC or SSD disk.

The pre-installed SPI loader won’t be able to chain Talos u-boot on the device because it’s too outdated.

Instead, it is necessary to update u-boot to a more recent version for this process to work. The Armbian u-boot build for Rock PI 4c has been proved to work: https://users.armbian.com/piter75/.

Steps

  • Flash the Rock PI 4c variant of Debian to the SD card.
  • Check that /dev/mtdblock0 exists otherwise the command will silently fail; e.g. lsblk.
  • Download Armbian u-boot and update SPI flash:
curl -LO https://users.armbian.com/piter75/rkspi_loader-v20.11.2-trunk-v2.img
sudo dd if=rkspi_loader-v20.11.2-trunk-v2.img of=/dev/mtdblock0 bs=4K
  • Optionally, you can also write Talos image to the SSD drive right from your Rock PI board:
curl -LO https://github.com/talos-systems/talos/releases/latest/download/metal-rockpi_4-arm64.img.xz
xz -d metal-rockpi_4-arm64.img.xz
sudo dd if=metal-rockpi_4-arm64.img.xz of=/dev/nvme0n1
  • remove SD card and reboot.

After these steps, Talos will boot from the SSD and enter maintenance mode. The rest of the flow is the same as running Talos from the SD card.

6.6 - Raspberry Pi 4 Model B

Installing Talos on Rpi4 SBC using raw disk image.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Prerequisites

You will need

  • talosctl
  • an SD card

Download the latest alpha talosctl.

curl -Lo /usr/local/bin/talosctl https://github.com/talos-systems/talos/releases/latest/download/talosctl-$(uname -s | tr "[:upper:]" "[:lower:]")-amd64
chmod +x /usr/local/bin/talosctl

Updating the EEPROM

At least version v2020.09.03-138a1 of the bootloader (rpi-eeprom) is required. To update the bootloader we will need an SD card. Insert the SD card into your computer and use Raspberry Pi Imager to install the bootloader on it (select Operating System > Misc utility images > Bootloader > SD Card Boot). Alternatively, you can use the console on Linux or macOS. The path to your SD card can be found using fdisk on Linux or diskutil on macOS. In this example, we will assume /dev/mmcblk0.

curl -Lo rpi-boot-eeprom-recovery.zip https://github.com/raspberrypi/rpi-eeprom/releases/download/v2021.04.29-138a1/rpi-boot-eeprom-recovery-2021-04-29-vl805-000138a1.zip
sudo mkfs.fat -I /dev/mmcblk0
sudo mount /dev/mmcblk0p1 /mnt
sudo bsdtar rpi-boot-eeprom-recovery.zip -C /mnt

Remove the SD card from your local machine and insert it into the Raspberry Pi. Power the Raspberry Pi on, and wait at least 10 seconds. If successful, the green LED light will blink rapidly (forever), otherwise an error pattern will be displayed. If an HDMI display is attached to the port closest to the power/USB-C port, the screen will display green for success or red if a failure occurs. Power off the Raspberry Pi and remove the SD card from it.

Note: Updating the bootloader only needs to be done once.

Download the Image

Download the image and decompress it:

curl -LO https://github.com/talos-systems/talos/releases/latest/download/metal-rpi_4-arm64.img.xz
xz -d metal-rpi_4-arm64.img.xz

Writing the Image

Now dd the image to your SD card:

sudo dd if=metal-rpi_4-arm64.img of=/dev/mmcblk0 conv=fsync bs=4M

Bootstrapping the Node

Insert the SD card to your board, turn it on and wait for the console to show you the instructions for bootstrapping the node. Following the instructions in the console output to connect to the interactive installer:

talosctl apply-config --insecure --interactive --nodes <node IP or DNS name>

Once the interactive installation is applied, the cluster will form and you can then use kubectl.

Note: if you have an HDMI display attached and it shows only a rainbow splash, please use the other HDMI port, the one closest to the power/USB-C port.

Retrieve the kubeconfig

Retrieve the admin kubeconfig by running:

talosctl kubeconfig

Troubleshooting

The following table can be used to troubleshoot booting issues:

Long FlashesShort FlashesStatus
03Generic failure to boot
04start*.elf not found
07Kernel image not found
08SDRAM failure
09Insufficient SDRAM
010In HALT state
21Partition not FAT
22Failed to read from partition
23Extended partition not FAT
24File signature/hash mismatch - Pi 4
44Unsupported board type
45Fatal firmware error
46Power failure type A
47Power failure type B

7 - Guides

7.1 - Adding a proprietary kernel module to Talos Linux

  1. Patching and building the kernel image

    1. Clone the pkgs repository from Github and check out the revision corresponding to your version of Talos Linux

      git clone https://github.com/talos-systems/pkgs pkgs && cd pkgs
      git checkout v0.8.0
      
    2. Clone the Linux kernel and check out the revision that pkgs uses (this can be found in kernel/kernel-prepare/pkg.yaml and it will be something like the following: https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-x.xx.x.tar.xz)

      git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git && cd linux
      git checkout v5.15
      
    3. Your module will need to be converted to be in-tree. The steps for this are different depending on the complexity of the module to port, but generally it would involve moving the module source code into the drivers tree and creating a new Makefile and Kconfig.

    4. Stage your changes in Git with git add -A.

    5. Run git diff --cached --no-prefix > foobar.patch to generate a patch from your changes.

    6. Copy this patch to kernel/kernel/patches in the pkgs repo.

    7. Add a patch line in the prepare segment of kernel/kernel/pkg.yaml:

      patch -p0 < /pkg/patches/foobar.patch
      
    8. Build the kernel image. Make sure you are logged in to ghcr.io before running this command, and you can change or omit PLATFORM depending on what you want to target.

      make kernel PLATFORM=linux/amd64 USERNAME=your-username PUSH=true
      
    9. Make a note of the image name the make command outputs.

  2. Building the installer image

    1. Copy the following into a new Dockerfile:

      FROM scratch AS customization
      COPY --from=ghcr.io/your-username/kernel:<kernel version> /lib/modules /lib/modules
      
      FROM ghcr.io/talos-systems/installer:<talos version>
      COPY --from=ghcr.io/your-username/kernel:<kernel version> /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz
      
    2. Run to build and push the installer:

      INSTALLER_VERSION=<talos version>
      IMAGE_NAME="ghcr.io/your-username/talos-installer:$INSTALLER_VERSION"
      DOCKER_BUILDKIT=0 docker build --build-arg RM="/lib/modules" -t "$IMAGE_NAME" . && docker push "$IMAGE_NAME"
      
  3. Deploying to your cluster

    talosctl upgrade --image ghcr.io/your-username/talos-installer:<talos version> --preserve=true
    

7.2 - Advanced Networking

Static Addressing

Static addressing is comprised of specifying addresses, routes ( remember to add your default gateway ), and interface. Most likely you’ll also want to define the nameservers so you have properly functioning DNS.

machine:
  network:
    hostname: talos
    nameservers:
      - 10.0.0.1
    interfaces:
      - interface: eth0
        addresses:
          - 10.0.0.201/8
        mtu: 8765
        routes:
          - network: 0.0.0.0/0
            gateway: 10.0.0.1
      - interface: eth1
        ignore: true
  time:
    servers:
      - time.cloudflare.com

Additional Addresses for an Interface

In some environments you may need to set additional addresses on an interface. In the following example, we set two additional addresses on the loopback interface.

machine:
  network:
    interfaces:
      - interface: lo
        addresses:
          - 192.168.0.21/24
          - 10.2.2.2/24

Bonding

The following example shows how to create a bonded interface.

machine:
  network:
    interfaces:
      - interface: bond0
        dhcp: true
        bond:
          mode: 802.3ad
          lacpRate: fast
          xmitHashPolicy: layer3+4
          miimon: 100
          updelay: 200
          downdelay: 200
          interfaces:
            - eth0
            - eth1

VLANs

To setup vlans on a specific device use an array of VLANs to add. The master device may be configured without addressing by setting dhcp to false.

machine:
  network:
    interfaces:
      - interface: eth0
        dhcp: false
        vlans:
          - vlanId: 100
            addresses:
              - "192.168.2.10/28"
            routes:
              - network: 0.0.0.0/0
                gateway: 192.168.2.1

7.3 - Air-gapped Environments

In this guide we will create a Talos cluster running in an air-gapped environment with all the required images being pulled from an internal registry. We will use the QEMU provisioner available in talosctl to create a local cluster, but the same approach could be used to deploy Talos in bigger air-gapped networks.

Requirements

The follow are requirements for this guide:

  • Docker 18.03 or greater
  • Requirements for the Talos QEMU cluster

Identifying Images

In air-gapped environments, access to the public Internet is restricted, so Talos can’t pull images from public Docker registries (docker.io, ghcr.io, etc.) We need to identify the images required to install and run Talos. The same strategy can be used for images required by custom workloads running on the cluster.

The talosctl images command provides a list of default images used by the Talos cluster (with default configuration settings). To print the list of images, run:

talosctl images

This list contains images required by a default deployment of Talos. There might be additional images required for the workloads running on this cluster, and those should be added to this list.

Preparing the Internal Registry

As access to the public registries is restricted, we have to run an internal Docker registry. In this guide, we will launch the registry on the same machine using Docker:

$ docker run -d -p 6000:5000 --restart always --name registry-aigrapped registry:2
1bf09802bee1476bc463d972c686f90a64640d87dacce1ac8485585de69c91a5

This registry will be accepting connections on port 6000 on the host IPs. The registry is empty by default, so we have fill it with the images required by Talos.

First, we pull all the images to our local Docker daemon:

$ for image in `talosctl images`; do docker pull $image; done
v0.12.0-amd64: Pulling from coreos/flannel
Digest: sha256:6d451d92c921f14bfb38196aacb6e506d4593c5b3c9d40a8b8a2506010dc3e10
...

All images are now stored in the Docker daemon store:

$ docker images
ghcr.io/talos-systems/install-cni    v0.3.0-12-g90722c3      980d36ee2ee1        5 days ago          79.7MB
k8s.gcr.io/kube-proxy-amd64          v1.20.0                 33c60812eab8        2 weeks ago         118MB
...

Now we need to re-tag them so that we can push them to our local registry. We are going to replace the first component of the image name (before the first slash) with our registry endpoint 127.0.0.1:6000:

$ for image in `talosctl images`; do \
    docker tag $image `echo $image | sed -E 's#^[^/]+/#127.0.0.1:6000/#'` \
  done

As the next step, we push images to the internal registry:

$ for image in `talosctl images`; do \
    docker push `echo $image | sed -E 's#^[^/]+/#127.0.0.1:6000/#'` \
  done

We can now verify that the images are pushed to the registry:

$ curl  http://127.0.0.1:6000/v2/_catalog
{"repositories":["autonomy/kubelet","coredns","coreos/flannel","etcd-development/etcd","kube-apiserver-amd64","kube-controller-manager-amd64","kube-proxy-amd64","kube-scheduler-amd64","talos-systems/install-cni","talos-systems/installer"]}

Note: images in the registry don’t have the registry endpoint prefix anymore.

Launching Talos in an Air-gapped Environment

For Talos to use the internal registry, we use the registry mirror feature to redirect all the image pull requests to the internal registry. This means that the registry endpoint (as the first component of the image reference) gets ignored, and all pull requests are sent directly to the specified endpoint.

We are going to use a QEMU-based Talos cluster for this guide, but the same approach works with Docker-based clusters as well. As QEMU-based clusters go through the Talos install process, they can be used better to model a real air-gapped environment.

The talosctl cluster create command provides conveniences for common configuration options. The only required flag for this guide is --registry-mirror '*'=http://10.5.0.1:6000 which redirects every pull request to the internal registry. The endpoint being used is 10.5.0.1, as this is the default bridge interface address which will be routable from the QEMU VMs (127.0.0.1 IP will be pointing to the VM itself).

$ sudo -E talosctl cluster create --provisioner=qemu --registry-mirror '*'=http://10.5.0.1:6000 --install-image=ghcr.io/talos-systems/installer:v0.14.0
validating CIDR and reserving IPs
generating PKI and tokens
creating state directory in "/home/smira/.talos/clusters/talos-default"
creating network talos-default
creating load balancer
creating dhcpd
creating master nodes
creating worker nodes
waiting for API
...

Note: --install-image should match the image which was copied into the internal registry in the previous step.

You can be verify that the cluster is air-gapped by inspecting the registry logs: docker logs -f registry-airgapped.

Closing Notes

Running in an air-gapped environment might require additional configuration changes, for example using custom settings for DNS and NTP servers.

When scaling this guide to the bare-metal environment, following Talos config snippet could be used as an equivalent of the --registry-mirror flag above:

machine:
  ...
  registries:
      mirrors:
      '*':
          endpoints:
          - http://10.5.0.1:6000/
...

Other implementations of Docker registry can be used in place of the Docker registry image used above to run the registry. If required, auth can be configured for the internal registry (and custom TLS certificates if needed).

7.4 - Configuring Ceph with Rook

Preparation

Talos Linux reserves an entire disk for the OS installation, so machines with multiple available disks are needed for a reliable Ceph cluster with Rook and Talos Linux. Rook requires that the block devices or partitions used by Ceph have no partitions or formatted filesystems before use. Rook also requires a minimum Kubernetes version of v1.16 and Helm v3.0 for installation of charts. It is highly recommended that the Rook Ceph overview is read and understood before deploying a Ceph cluster with Rook.

Installation

Creating a Ceph cluster with Rook requires two steps; first the Rook Operator needs to be installed which can be done with a Helm Chart. The example below installs the Rook Operator into the rook-ceph namespace, which is the default for a Ceph cluster with Rook.

$ helm repo add rook-release https://charts.rook.io/release
"rook-release" has been added to your repositories

$ helm install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph
W0327 17:52:44.277830   54987 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0327 17:52:44.612243   54987 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: rook-ceph
LAST DEPLOYED: Sun Mar 27 17:52:42 2022
NAMESPACE: rook-ceph
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Rook Operator has been installed. Check its status by running:
  kubectl --namespace rook-ceph get pods -l "app=rook-ceph-operator"

Visit https://rook.io/docs/rook/latest for instructions on how to create and configure Rook clusters

Important Notes:
- You must customize the 'CephCluster' resource in the sample manifests for your cluster.
- Each CephCluster must be deployed to its own namespace, the samples use `rook-ceph` for the namespace.
- The sample manifests assume you also installed the rook-ceph operator in the `rook-ceph` namespace.
- The helm chart includes all the RBAC required to create a CephCluster CRD in the same namespace.
- Any disk devices you add to the cluster in the 'CephCluster' must be empty (no filesystem and no partitions).

Once that is complete, the Ceph cluster can be installed with the official Helm Chart. The Chart can be installed with default values, which will attempt to use all nodes in the Kubernetes cluster, and all unused disks on each node for Ceph storage, and make available block storage, object storage, as well as a shared filesystem. Generally more specific node/device/cluster configuration is used, and the Rook documentation explains all the available options in detail. For this example the defaults will be adequate.

$ helm install --create-namespace --namespace rook-ceph rook-ceph-cluster --set operatorNamespace=rook-ceph rook-release/rook-ceph-cluster
NAME: rook-ceph-cluster
LAST DEPLOYED: Sun Mar 27 18:12:46 2022
NAMESPACE: rook-ceph
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Ceph Cluster has been installed. Check its status by running:
  kubectl --namespace rook-ceph get cephcluster

Visit https://rook.github.io/docs/rook/latest/ceph-cluster-crd.html for more information about the Ceph CRD.

Important Notes:
- You can only deploy a single cluster per namespace
- If you wish to delete this cluster and start fresh, you will also have to wipe the OSD disks using `sfdisk`

Now the Ceph cluster configuration has been created, the Rook operator needs time to install the Ceph cluster and bring all the components online. The progression of the Ceph cluster state can be followed with the following command.

$ watch kubectl --namespace rook-ceph get cephcluster rook-ceph
Every 2.0s: kubectl --namespace rook-ceph get cephcluster rook-ceph

NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE         MESSAGE                 HEALTH   EXTERNAL
rook-ceph   /var/lib/rook     3          57s   Progressing   Configuring Ceph Mons

Depending on the size of the Ceph cluster and the availability of resources the Ceph cluster should become available, and with it the storage classes that can be used with Kubernetes Physical Volumes.

$ kubectl --namespace rook-ceph get cephcluster rook-ceph
NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH      EXTERNAL
rook-ceph   /var/lib/rook     3          40m   Ready   Cluster created successfully   HEALTH_OK

$ kubectl  get storageclass
NAME                   PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ceph-block (default)   rook-ceph.rbd.csi.ceph.com      Delete          Immediate           true                   77m
ceph-bucket            rook-ceph.ceph.rook.io/bucket   Delete          Immediate           false                  77m
ceph-filesystem        rook-ceph.cephfs.csi.ceph.com   Delete          Immediate           true                   77m

Talos Linux Considerations

It is important to note that a Rook Ceph cluster saves cluster information directly onto the node (be default dataDirHostPath is set to /var/lib/rook) which under Talos Linux is ephemeral. This makes cluster management a little bit more involved, as any time a Talos Linux node is reconfigured or upgraded, the ephemeral partition is wiped.

When performing maintenance on a Talos Linux node with a Rook Ceph cluster, it is imperative that care be taken to maintain the health of the Ceph cluster, for instance when upgrading the Talos Linux version. Before upgrading, you should always check the health status of the Ceph cluster to ensure that it is healthy.

$ kubectl --namespace rook-ceph get cephclusters.ceph.rook.io rook-ceph
NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH      EXTERNAL
rook-ceph   /var/lib/rook     3          98m   Ready   Cluster created successfully   HEALTH_OK

If it is, you can begin the upgrade process for the Talos Linux node, during which time the Ceph cluster will become unhealthy as the node is reconfigured. Before performing any other action on the Talos Linux nodes, the Ceph cluster must return to a healthy status.

$ talosctl upgrade --nodes 172.20.15.5 --image ghcr.io/talos-systems/installer:v0.14.3
NODE          ACK                        STARTED
172.20.15.5   Upgrade request received   2022-03-27 20:29:55.292432887 +0200 CEST m=+10.050399758

$ kubectl --namespace rook-ceph get cephclusters.ceph.rook.io
NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE         MESSAGE                   HEALTH        EXTERNAL
rook-ceph   /var/lib/rook     3          99m   Progressing   Configuring Ceph Mgr(s)   HEALTH_WARN

$ kubectl --namespace rook-ceph wait --timeout=1800s --for=jsonpath='{.status.ceph.health}=HEALTH_OK' rook-ceph
cephcluster.ceph.rook.io/rook-ceph condition met

The above steps need to be performed for each Talos Linux node undergoing maintenance, one at a time.

Cleaning Up

Rook Ceph Cluster Removal

Removing a Rook Ceph cluster requires a few steps, starting with signalling to Rook that the Ceph cluster is really being destroyed. Then all Persistent Volumes (and Claims) backed by the Ceph cluster must be deleted, followed by the Storage Classes and the Ceph storage types.

$ kubectl --namespace rook-ceph patch cephcluster rook-ceph --type merge -p '{"spec":{"cleanupPolicy":{"confirmation":"yes-really-destroy-data"}}}'
cephcluster.ceph.rook.io/rook-ceph patched

$ kubectl delete storageclasses ceph-block ceph-bucket ceph-filesystem
storageclass.storage.k8s.io "ceph-block" deleted
storageclass.storage.k8s.io "ceph-bucket" deleted
storageclass.storage.k8s.io "ceph-filesystem" deleted

$ kubectl --namespace rook-ceph delete cephblockpools ceph-blockpool
cephblockpool.ceph.rook.io "ceph-blockpool" deleted

$ kubectl --namespace rook-ceph delete cephobjectstore ceph-objectstore
cephobjectstore.ceph.rook.io "ceph-objectstore" deleted

$ kubectl --namespace rook-ceph delete cephfilesystem ceph-filesystem
cephfilesystem.ceph.rook.io "ceph-filesystem" deleted

Once that is complete, the Ceph cluster itself can be removed, along with the Rook Ceph cluster Helm chart installation.

$ kubectl --namespace rook-ceph delete cephcluster rook-ceph
cephcluster.ceph.rook.io "rook-ceph" deleted

$ helm --namespace rook-ceph uninstall rook-ceph-cluster
release "rook-ceph-cluster" uninstalled

If needed, the Rook Operator can also be removed along with all the Custom Resource Definitions that it created.

$ helm --namespace rook-ceph uninstall rook-ceph
W0328 12:41:14.998307  147203 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
These resources were kept due to the resource policy:
[CustomResourceDefinition] cephblockpools.ceph.rook.io
[CustomResourceDefinition] cephbucketnotifications.ceph.rook.io
[CustomResourceDefinition] cephbuckettopics.ceph.rook.io
[CustomResourceDefinition] cephclients.ceph.rook.io
[CustomResourceDefinition] cephclusters.ceph.rook.io
[CustomResourceDefinition] cephfilesystemmirrors.ceph.rook.io
[CustomResourceDefinition] cephfilesystems.ceph.rook.io
[CustomResourceDefinition] cephfilesystemsubvolumegroups.ceph.rook.io
[CustomResourceDefinition] cephnfses.ceph.rook.io
[CustomResourceDefinition] cephobjectrealms.ceph.rook.io
[CustomResourceDefinition] cephobjectstores.ceph.rook.io
[CustomResourceDefinition] cephobjectstoreusers.ceph.rook.io
[CustomResourceDefinition] cephobjectzonegroups.ceph.rook.io
[CustomResourceDefinition] cephobjectzones.ceph.rook.io
[CustomResourceDefinition] cephrbdmirrors.ceph.rook.io
[CustomResourceDefinition] objectbucketclaims.objectbucket.io
[CustomResourceDefinition] objectbuckets.objectbucket.io

release "rook-ceph" uninstalled

$ kubectl delete crds cephblockpools.ceph.rook.io cephbucketnotifications.ceph.rook.io cephbuckettopics.ceph.rook.io \
                      cephclients.ceph.rook.io cephclusters.ceph.rook.io cephfilesystemmirrors.ceph.rook.io \
                      cephfilesystems.ceph.rook.io cephfilesystemsubvolumegroups.ceph.rook.io \
                      cephnfses.ceph.rook.io cephobjectrealms.ceph.rook.io cephobjectstores.ceph.rook.io \
                      cephobjectstoreusers.ceph.rook.io cephobjectzonegroups.ceph.rook.io cephobjectzones.ceph.rook.io \
                      cephrbdmirrors.ceph.rook.io objectbucketclaims.objectbucket.io objectbuckets.objectbucket.io
customresourcedefinition.apiextensions.k8s.io "cephblockpools.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephbucketnotifications.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephbuckettopics.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephclients.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephclusters.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephfilesystemmirrors.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephfilesystems.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephfilesystemsubvolumegroups.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephnfses.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectrealms.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectstores.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectstoreusers.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectzonegroups.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectzones.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephrbdmirrors.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "objectbucketclaims.objectbucket.io" deleted
customresourcedefinition.apiextensions.k8s.io "objectbuckets.objectbucket.io" deleted

Talos Linux Rook Metadata Removal

If the Rook Operator is cleanly removed following the above process, the node metadata and disks should be clean and ready to be re-used. In the case of an unclean cluster removal, there may be still a few instances of metadata stored on the system disk, as well as the partition information on the storage disks. First the node metadata needs to be removed, make sure to update the nodeName with the actual name of a storage node that needs cleaning, and path with the Rook configuration dataDirHostPath set when installing the chart. The following will need to be repeated for each node used in the Rook Ceph cluster.

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: disk-clean
spec:
  restartPolicy: Never
  nodeName: <storage-node-name>
  volumes:
  - name: rook-data-dir
    hostPath:
      path: <dataDirHostPath>
  containers:
  - name: disk-clean
    image: busybox
    securityContext:
      privileged: true
    volumeMounts:
    - name: rook-data-dir
      mountPath: /node/rook-data
    command: ["/bin/sh", "-c", "rm -rf /node/rook-data/*"]
EOF
pod/disk-clean created

$ kubectl wait --timeout=900s --for=jsonpath='{.status.phase}=Succeeded' pod disk-clean
pod/disk-clean condition met

$ kubectl delete pod disk-clean
pod "disk-clean" deleted

Lastly, the disks themselves need the partition and filesystem data wiped before they can be reused. Again, the following as to be repeated for each node and disk used in the Rook Ceph cluster, updating nodeName and of= in the command as needed.

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: disk-wipe
spec:
  restartPolicy: Never
  nodeName: <storage-node-name>
  containers:
  - name: disk-wipe
    image: busybox
    securityContext:
      privileged: true
    command: ["/bin/sh", "-c", "dd if=/dev/zero bs=1M count=100 oflag=direct of=<device>"]
EOF
pod/disk-wipe created

$ kubectl wait --timeout=900s --for=jsonpath='{.status.phase}=Succeeded' pod disk-wipe
pod/disk-wipe condition met

$ kubectl delete pod disk-clean
pod "disk-wipe" deleted

7.5 - Configuring Certificate Authorities

Appending the Certificate Authority

Put into each machine the PEM encoded certificate:

machine:
  ...
  files:
    - content: |
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----        
      permissions: 0644
      path: /etc/ssl/certs/ca-certificates
      op: append

7.6 - Configuring Containerd

The base containerd configuration expects to merge in any additional configs present in /var/cri/conf.d/*.toml.

An example of exposing metrics

Into each machine config, add the following:

machine:
  ...
  files:
    - content: |
        [metrics]
          address = "0.0.0.0:11234"        
      path: /var/cri/conf.d/metrics.toml
      op: create

Create cluster like normal and see that metrics are now present on this port:

$ curl 127.0.0.1:11234/v1/metrics
# HELP container_blkio_io_service_bytes_recursive_bytes The blkio io service bytes recursive
# TYPE container_blkio_io_service_bytes_recursive_bytes gauge
container_blkio_io_service_bytes_recursive_bytes{container_id="0677d73196f5f4be1d408aab1c4125cf9e6c458a4bea39e590ac779709ffbe14",device="/dev/dm-0",major="253",minor="0",namespace="k8s.io",op="Async"} 0
container_blkio_io_service_bytes_recursive_bytes{container_id="0677d73196f5f4be1d408aab1c4125cf9e6c458a4bea39e590ac779709ffbe14",device="/dev/dm-0",major="253",minor="0",namespace="k8s.io",op="Discard"} 0
...
...

7.7 - Configuring Corporate Proxies

Appending the Certificate Authority of MITM Proxies

Put into each machine the PEM encoded certificate:

machine:
  ...
  files:
    - content: |
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----        
      permissions: 0644
      path: /etc/ssl/certs/ca-certificates
      op: append

Configuring a Machine to Use the Proxy

To make use of a proxy:

machine:
  env:
    http_proxy: <http proxy>
    https_proxy: <https proxy>
    no_proxy: <no proxy>

Additionally, configure the DNS nameservers, and NTP servers:

machine:
  env:
  ...
  time:
    servers:
      - <server 1>
      - <server ...>
      - <server n>
  ...
  network:
    nameservers:
      - <ip 1>
      - <ip ...>
      - <ip n>

7.8 - Configuring Network Connectivity

Configuring Network Connectivity

The simplest way to deploy Talos is by ensuring that all the remote components of the system (talosctl, the control plane nodes, and worker nodes) all have layer 2 connectivity. This is not always possible, however, so this page lays out the minimal network access that is required to configure and operate a talos cluster.

Note: These are the ports required for Talos specifically, and should be configured in addition to the ports required by kubernetes. See the kubernetes docs for information on the ports used by kubernetes itself.

Control plane node(s)

ProtocolDirectionPort RangePurposeUsed By
TCPInbound50000*apidtalosctl
TCPInbound50001*trustdControl plane nodes, worker nodes

Ports marked with a * are not currently configurable, but that may change in the future. Follow along here.

Worker node(s)

ProtocolDirectionPort RangePurposeUsed By
TCPInbound50001*trustdControl plane nodes

Ports marked with a * are not currently configurable, but that may change in the future. Follow along here.

7.9 - Configuring Pull Through Cache

In this guide we will create a set of local caching Docker registry proxies to minimize local cluster startup time.

When running Talos locally, pulling images from Docker registries might take a significant amount of time. We spin up local caching pass-through registries to cache images and configure a local Talos cluster to use those proxies. A similar approach might be used to run Talos in production in air-gapped environments. It can be also used to verify that all the images are available in local registries.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Requirements

The follow are requirements for creating the set of caching proxies:

  • Docker 18.03 or greater
  • Local cluster requirements for either docker or QEMU.

Launch the Caching Docker Registry Proxies

Talos pulls from docker.io, k8s.gcr.io, quay.io, gcr.io, and ghcr.io by default. If your configuration is different, you might need to modify the commands below:

docker run -d -p 5000:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io \
    --restart always \
    --name registry-docker.io registry:2

docker run -d -p 5001:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://k8s.gcr.io \
    --restart always \
    --name registry-k8s.gcr.io registry:2

docker run -d -p 5002:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://quay.io \
    --restart always \
    --name registry-quay.io registry:2.5

docker run -d -p 5003:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://gcr.io \
    --restart always \
    --name registry-gcr.io registry:2

docker run -d -p 5004:5000 \
    -e REGISTRY_PROXY_REMOTEURL=https://ghcr.io \
    --restart always \
    --name registry-ghcr.io registry:2

Note: Proxies are started as docker containers, and they’re automatically configured to start with Docker daemon. Please note that quay.io proxy doesn’t support recent Docker image schema, so we run older registry image version (2.5).

As a registry container can only handle a single upstream Docker registry, we launch a container per upstream, each on its own host port (5000, 5001, 5002, 5003 and 5004).

Using Caching Registries with QEMU Local Cluster

With a QEMU local cluster, a bridge interface is created on the host. As registry containers expose their ports on the host, we can use bridge IP to direct proxy requests.

sudo talosctl cluster create --provisioner qemu \
    --registry-mirror docker.io=http://10.5.0.1:5000 \
    --registry-mirror k8s.gcr.io=http://10.5.0.1:5001 \
    --registry-mirror quay.io=http://10.5.0.1:5002 \
    --registry-mirror gcr.io=http://10.5.0.1:5003 \
    --registry-mirror ghcr.io=http://10.5.0.1:5004

The Talos local cluster should now start pulling via caching registries. This can be verified via registry logs, e.g. docker logs -f registry-docker.io. The first time cluster boots, images are pulled and cached, so next cluster boot should be much faster.

Note: 10.5.0.1 is a bridge IP with default network (10.5.0.0/24), if using custom --cidr, value should be adjusted accordingly.

Using Caching Registries with docker Local Cluster

With a docker local cluster we can use docker bridge IP, default value for that IP is 172.17.0.1. On Linux, the docker bridge address can be inspected with ip addr show docker0.

talosctl cluster create --provisioner docker \
    --registry-mirror docker.io=http://172.17.0.1:5000 \
    --registry-mirror k8s.gcr.io=http://172.17.0.1:5001 \
    --registry-mirror quay.io=http://172.17.0.1:5002 \
    --registry-mirror gcr.io=http://172.17.0.1:5003 \
    --registry-mirror ghcr.io=http://172.17.0.1:5004

Cleaning Up

To cleanup, run:

docker rm -f registry-docker.io
docker rm -f registry-k8s.gcr.io
docker rm -f registry-quay.io
docker rm -f registry-gcr.io
docker rm -f registry-ghcr.io

Note: Removing docker registry containers also removes the image cache. So if you plan to use caching registries, keep the containers running.

7.10 - Configuring the Cluster Endpoint

In this section, we will step through the configuration of a Talos based Kubernetes cluster. There are three major components we will configure:

  • apid and talosctl
  • the master nodes
  • the worker nodes

Talos enforces a high level of security by using mutual TLS for authentication and authorization.

We recommend that the configuration of Talos be performed by a cluster owner. A cluster owner should be a person of authority within an organization, perhaps a director, manager, or senior member of a team. They are responsible for storing the root CA, and distributing the PKI for authorized cluster administrators.

Talos runs great out of the box, but if you tweak some minor settings it will make your life a lot easier in the future. This is not a requirement, but rather a document to explain some key settings.

Endpoint

To configure the talosctl endpoint, it is recommended you use a resolvable DNS name. This way, if you decide to upgrade to a multi-controlplane cluster you only have to add the ip address to the hostname configuration. The configuration can either be done on a Loadbalancer, or simply trough DNS.

For example:

This is in the config file for the cluster e.g. controlplane.yaml and worker.yaml. for more details, please see: v1alpha1 endpoint configuration

.....
cluster:
  controlPlane:
    endpoint: https://endpoint.example.local:6443
.....

If you have a DNS name as the endpoint, you can upgrade your talos cluster with multiple controlplanes in the future (if you don’t have a multi-controlplane setup from the start) Using a DNS name generates the corresponding Certificates (Kubernetes and Talos) for the correct hostname.

7.11 - Configuring Wireguard Network

In this guide you will learn how to set up Wireguard network using Kernel module.

Configuring Wireguard Network

Quick Start

The quickest way to try out Wireguard is to use talosctl cluster create command:

talosctl cluster create --wireguard-cidr 10.1.0.0/24

It will automatically generate Wireguard network configuration for each node with the following network topology:

Where all controlplane nodes will be used as Wireguard servers which listen on port 51111. All controlplanes and workers will connect to all controlplanes. It also sets PersistentKeepalive to 5 seconds to establish controlplanes to workers connection.

After the cluster is deployed it should be possible to verify Wireguard network connectivity. It is possible to deploy a container with hostNetwork enabled, then do kubectl exec <container> /bin/bash and either do:

ping 10.1.0.2

Or install wireguard-tools package and run:

wg show

Wireguard show should output something like this:

interface: wg0
  public key: OMhgEvNIaEN7zeCLijRh4c+0Hwh3erjknzdyvVlrkGM=
  private key: (hidden)
  listening port: 47946

peer: 1EsxUygZo8/URWs18tqB5FW2cLVlaTA+lUisKIf8nh4=
  endpoint: 10.5.0.2:51111
  allowed ips: 10.1.0.0/24
  latest handshake: 1 minute, 55 seconds ago
  transfer: 3.17 KiB received, 3.55 KiB sent
  persistent keepalive: every 5 seconds

It is also possible to use generated configuration as a reference by pulling generated config files using:

talosctl read -n 10.5.0.2 /system/state/config.yaml > controlplane.yaml
talosctl read -n 10.5.0.3 /system/state/config.yaml > worker.yaml

Manual Configuration

All Wireguard configuration can be done by changing Talos machine config files. As an example we will use this official Wireguard quick start tutorial.

Key Generation

This part is exactly the same:

wg genkey | tee privatekey | wg pubkey > publickey

Setting up Device

Inline comments show relations between configs and wg quickstart tutorial commands:

...
network:
  interfaces:
    ...
      # ip link add dev wg0 type wireguard
    - interface: wg0
      mtu: 1500
      # ip address add dev wg0 192.168.2.1/24
      addresses:
        - 192.168.2.1/24
      # wg set wg0 listen-port 51820 private-key /path/to/private-key peer ABCDEF... allowed-ips 192.168.88.0/24 endpoint 209.202.254.14:8172
      wireguard:
        privateKey: <privatekey file contents>
        listenPort: 51820
        peers:
          allowedIPs:
            - 192.168.88.0/24
          endpoint: 209.202.254.14.8172
          publicKey: ABCDEF...
...

When networkd gets this configuration it will create the device, configure it and will bring it up (equivalent to ip link set up dev wg0).

All supported config parameters are described in the Machine Config Reference.

7.12 - Customizing the Kernel

The installer image contains ONBUILD instructions that handle the following:

  • the decompression, and unpacking of the initramfs.xz
  • the unsquashing of the rootfs
  • the copying of new rootfs files
  • the squashing of the new rootfs
  • and the packing, and compression of the new initramfs.xz

When used as a base image, the installer will perform the above steps automatically with the requirement that a customization stage be defined in the Dockerfile.

Build and push your own kernel:

git clone https://github.com/talos-systems/pkgs.git
cd pkgs
make kernel-menuconfig USERNAME=_your_github_user_name_

docker login ghcr.io --username _your_github_user_name_
make kernel USERNAME=_your_github_user_name_ PUSH=true

Using a multi-stage Dockerfile we can define the customization stage and build FROM the installer image:

FROM scratch AS customization
COPY --from=<custom kernel image> /lib/modules /lib/modules

FROM ghcr.io/talos-systems/installer:latest
COPY --from=<custom kernel image> /boot/vmlinuz /usr/install/${TARGETARCH}/vmlinuz

When building the image, the customization stage will automatically be copied into the rootfs. The customization stage is not limited to a single COPY instruction. In fact, you can do whatever you would like in this stage, but keep in mind that everything in / will be copied into the rootfs.

To build the image, run:

DOCKER_BUILDKIT=0 docker build --build-arg RM="/lib/modules" -t installer:kernel .

Note: buildkit has a bug #816, to disable it use DOCKER_BUILDKIT=0

Now that we have a custom installer we can build Talos for the specific platform we wish to deploy to.

7.13 - Customizing the Root Filesystem

The installer image contains ONBUILD instructions that handle the following:

  • the decompression, and unpacking of the initramfs.xz
  • the unsquashing of the rootfs
  • the copying of new rootfs files
  • the squashing of the new rootfs
  • and the packing, and compression of the new initramfs.xz

When used as a base image, the installer will perform the above steps automatically with the requirement that a customization stage be defined in the Dockerfile.

For example, say we have an image that contains the contents of a library we wish to add to the Talos rootfs. We need to define a stage with the name customization:

FROM scratch AS customization
COPY --from=<name|index> <src> <dest>

Using a multi-stage Dockerfile we can define the customization stage and build FROM the installer image:

FROM scratch AS customization
COPY --from=<name|index> <src> <dest>

FROM ghcr.io/talos-systems/installer:latest

When building the image, the customization stage will automatically be copied into the rootfs. The customization stage is not limited to a single COPY instruction. In fact, you can do whatever you would like in this stage, but keep in mind that everything in / will be copied into the rootfs.

Note: <dest> is the path relative to the rootfs that you wish to place the contents of <src>.

To build the image, run:

docker build --squash -t <organization>/installer:latest .

In the case that you need to perform some cleanup before adding additional files to the rootfs, you can specify the RM build-time variable:

docker build --squash --build-arg RM="[<path> ...]" -t <organization>/installer:latest .

This will perform a rm -rf on the specified paths relative to the rootfs.

Note: RM must be a whitespace delimited list.

The resulting image can be used to:

  • generate an image for any of the supported providers
  • perform bare-metall installs
  • perform upgrades

We will step through common customizations in the remainder of this section.

7.14 - Deploying Cilium CNI

In this guide you will learn how to set up Cilium CNI on Talos.

From v1.9 onwards Cilium does no longer provide a one-liner install manifest that can be used to install Cilium on a node via kubectl apply -f or passing it in as an extra url in the urls part in the Talos machine configuration.

Installing Cilium the new way via the cilium cli is broken, so we’ll be using helm to install Cilium. For more information: Install with CLI fails, works with Helm

Refer to Installing with Helm for more information.

First we’ll need to add the helm repo for Cilium.

helm repo add cilium https://helm.cilium.io/
helm repo update

This documentation will outline installing Cilium CNI v1.11.2 on Talos in four different ways. Adhering to Talos principles we’ll deploy Cilium with IPAM mode set to Kubernetes. Each method can either install Cilium using kube proxy (default) or without: Kubernetes Without kube-proxy

Machine config preparation

When generating the machine config for a node set the CNI to none. For example using a config patch:

talosctl gen config \
    my-cluster https://mycluster.local:6443 \
    --config-patch '[{"op":"add", "path": "/cluster/network/cni", "value": {"name": "none"}}]'

Or if you want to deploy Cilium in strict mode without kube-proxy, you also need to disable kube proxy:

talosctl gen config \
    my-cluster https://mycluster.local:6443 \
    --config-patch '[{"op": "add", "path": "/cluster/proxy", "value": {"disabled": true}}, {"op":"add", "path": "/cluster/network/cni", "value": {"name": "none"}}]'

Method 1: Helm install

After applying the machine config and bootstrapping Talos will appear to hang on phase 18/19 with the message: retrying error: node not ready. This happens because nodes in Kubernetes are only marked as ready once the CNI is up. As there is no CNI defined, the boot process is pending and will reboot the node to retry after 10 minutes, this is expected behavior.

During this window you can install Cilium manually by running the following:

helm install cilium cilium/cilium \
    --version 1.11.2 \
    --namespace kube-system \
    --set ipam.mode=kubernetes

Or if you want to deploy Cilium in strict mode without kube-proxy, also set some extra paramaters:

export KUBERNETES_API_SERVER_ADDRESS=<>
export KUBERNETES_API_SERVER_PORT=6443

helm install cilium cilium/cilium \
    --version 1.11.2 \
    --namespace kube-system \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=strict \
    --set k8sServiceHost="${KUBERNETES_API_SERVER_ADDRESS}" \
    --set k8sServicePort="${KUBERNETES_API_SERVER_PORT}"

After Cilium is installed the boot process should continue and complete successfully.

Method 2: Helm manifests install

Instead of directly installing Cilium you can instead first generate the manifest and then apply it:

helm template cilium cilium/cilium \
    --version 1.11.2 \
    --namespace kube-system
    --set ipam.mode=kubernetes > cilium.yaml

kubectl apply -f cilium.yaml

Without kube-proxy:

export KUBERNETES_API_SERVER_ADDRESS=<>
export KUBERNETES_API_SERVER_PORT=6443

helm template cilium cilium/cilium \
    --version 1.11.2 \
    --namespace kube-system \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=strict \
    --set k8sServiceHost="${KUBERNETES_API_SERVER_ADDRESS}" \
    --set k8sServicePort="${KUBERNETES_API_SERVER_PORT}" > cilium.yaml

kubectl apply -f cilium.yaml

Method 3: Helm manifests hosted install

After generating cilium.yaml using helm template, instead of applying this manifest directly during the Talos boot window (before the reboot timeout). You can also host this file somewhere and patch the machine config to apply this manifest automatically during bootstrap. To do this patch your machine configuration to include this config instead of the above:

talosctl gen config \
    my-cluster https://mycluster.local:6443 \
    --config-patch '[{"op":"add", "path": "/cluster/network/cni", "value": {"name": "custom", "urls": ["https://server.yourdomain.tld/some/path/cilium.yaml"]}}]'

Resulting in a config that look like this:

name: custom # Name of CNI to use.
# URLs containing manifests to apply for the CNI.
urls:
    - https://server.yourdomain.tld/some/path/cilium.yaml

However, beware of the fact that the helm generated Cilium manifest contains sensitive key material. As such you should definitely not host this somewhere publicly accessible.

Method 4: Helm manifests inline install

A more secure option would be to include the helm template output manifest inside the machine configuration. The machine config should be generated with CNI set to none

talosctl gen config \
    my-cluster https://mycluster.local:6443 \
    --config-patch '[{"op":"add", "path": "/cluster/network/cni", "value": {"name": "none"}}]'

if deploying Cilium with kube-proxy disabled, you can also include the following:

talosctl gen config \
    my-cluster https://mycluster.local:6443 \
    --config-patch '[{"op": "add", "path": "/cluster/proxy", "value": {"disabled": true}}, {"op":"add", "path": "/cluster/network/cni", "value": {"name": "none"}}]'

To do so patch this into your machine configuration:

inlineManifests:
    - name: cilium
      contents: |
        --
        # Source: cilium/templates/cilium-agent/serviceaccount.yaml
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: "cilium"
          namespace: kube-system
        ---
        # Source: cilium/templates/cilium-operator/serviceaccount.yaml
        apiVersion: v1
        kind: ServiceAccount
        -> Your cilium.yaml file will be pretty long....        

This will install the Cilium manifests at just the right time during bootstrap.

Beware though:

  • Changing the namespace when templating with Helm does not generate a manifest containing the yaml to create that namespace. As the inline manifest is processed from top to bottom make sure to manually put the namespace yaml at the start of the inline manifest.
  • Only add the Cilium inline manifest to the control plane nodes machine configuration.
  • Make sure all control plane nodes have an identical configuration.
  • If you delete any of the generated resources they will be restored whenever a control plane node reboots.
  • As a safety measure Talos only creates missing resources from inline manifests, it never deletes or updates anything.
  • If you need to update a manifest make sure to first edit all control plane machine configurations and then run talosctl upgrade-k8s as it will take care of updating inline manifests.

Known issues

Other things to know

  • Talos has full kernel module support for eBPF, See:

  • Talos also includes the modules:

    • CONFIG_NETFILTER_XT_TARGET_TPROXY=m
    • CONFIG_NETFILTER_XT_TARGET_CT=m
    • CONFIG_NETFILTER_XT_MATCH_MARK=m
    • CONFIG_NETFILTER_XT_MATCH_SOCKET=m

    This allows you to set --set enableXTSocketFallback=false on the helm install/template command preventing Cilium from disabling the ip_early_demux kernel feature. This will win back some performance.

7.15 - Deploying Metrics Server

In this guide you will learn how to set up metrics-server.

Metrics Server enables use of the Horizontal Pod Autoscaler and Vertical Pod Autoscaler. It does this by gathering metrics data from the kubelets in a cluster. By default, the certificates in use by the kubelets will not be recognized by metrics-server. This can be solved by either configuring metrics-server to do no validation of the TLS certificates, or by modifying the kubelet configuration to rotate its certificates and use ones that will be recognized by metrics-server.

Node Configuration

To enable kubelet certificate rotation, all nodes should have the following Machine Config snippet:

machine:
  kubelet:
    extraArgs:
      rotate-server-certificates: true

Install During Bootstrap

We will want to ensure that new certificates for the kubelets are approved automatically. This can easily be done with the Kubelet Serving Certificate Approver, which will automatically approve the Certificate Signing Requests generated by the kubelets.

We can have Kubelet Serving Certificate Approver and metrics-server installed on the cluster automatically during bootstrap by adding the following snippet to the Cluster Config of the node that will be handling the bootstrap process:

cluster:
  extraManifests:
    - https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml
    - https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Install After Bootstrap

If you choose not to use extraManifests to install Kubelet Serving Certificate Approver and metrics-server during bootstrap, you can install them once the cluster is online using kubectl:

kubectl apply -f https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

7.16 - Disaster Recovery

Procedure for snapshotting etcd database and recovering from catastrophic control plane failure.

etcd database backs Kubernetes control plane state, so if the etcd service is unavailable Kubernetes control plane goes down, and the cluster is not recoverable until etcd is recovered with contents. The etcd consistency model builds around the consensus protocol Raft, so for highly-available control plane clusters, loss of one control plane node doesn’t impact cluster health. In general, etcd stays up as long as a sufficient number of nodes to maintain quorum are up. For a three control plane node Talos cluster, this means that the cluster tolerates a failure of any single node, but losing more than one node at the same time leads to complete loss of service. Because of that, it is important to take routine backups of etcd state to have a snapshot to recover cluster from in case of catastrophic failure.

Backup

Snapshotting etcd Database

Create a consistent snapshot of etcd database with talosctl etcd snapshot command:

$ talosctl -n <IP> etcd snapshot db.snapshot
etcd snapshot saved to "db.snapshot" (2015264 bytes)
snapshot info: hash c25fd181, revision 4193, total keys 1287, total size 3035136

Note: filename db.snapshot is arbitrary.

This database snapshot can be taken on any healthy control plane node (with IP address <IP> in the example above), as all etcd instances contain exactly same data. It is recommended to configure etcd snapshots to be created on some schedule to allow point-in-time recovery using the latest snapshot.

Disaster Database Snapshot

If etcd cluster is not healthy, the talosctl etcd snapshot command might fail. In that case, copy the database snapshot directly from the control plane node:

talosctl -n <IP> cp /var/lib/etcd/member/snap/db .

This snapshot might not be fully consistent (if the etcd process is running), but it allows for disaster recovery when latest regular snapshot is not available.

Machine Configuration

Machine configuration might be required to recover the node after hardware failure. Backup Talos node machine configuration with the command:

talosctl -n IP get mc v1alpha1 -o yaml | yq eval '.spec' -

Recovery

Before starting a disaster recovery procedure, make sure that etcd cluster can’t be recovered:

  • get etcd cluster member list on all healthy control plane nodes with talosctl -n IP etcd members command and compare across all members.
  • query etcd health across control plane nodes with talosctl -n IP service etcd.

If the quorum can be restored, restoring quorum might be a better strategy than performing full disaster recovery procedure.

Latest Etcd Snapshot

Get hold of the latest etcd database snapshot. If a snapshot is not fresh enough, create a database snapshot (see above), even if the etcd cluster is unhealthy.

Init Node

Make sure that there are no control plane nodes with machine type init:

$ talosctl -n <IP1>,<IP2>,... get machinetype
NODE         NAMESPACE   TYPE          ID             VERSION   TYPE
172.20.0.2   config      MachineType   machine-type   2         controlplane
172.20.0.4   config      MachineType   machine-type   2         controlplane
172.20.0.3   config      MachineType   machine-type   2         controlplane

Nodes with init type are incompatible with etcd recovery procedure. init node can be converted to controlplane type with talosctl edit mc --on-reboot command followed by node reboot with talosctl reboot command.

Preparing Control Plane Nodes

If some control plane nodes experienced hardware failure, replace them with new nodes. Use machine configuration backup to re-create the nodes with the same secret material and control plane settings to allow workers to join the recovered control plane.

If a control plane node is healthy but etcd isn’t, wipe the node’s EPHEMERAL partition to remove the etcd data directory (make sure a database snapshot is taken before doing this):

talosctl -n <IP> reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL

At this point, all control plane nodes should boot up, and etcd service should be in the Preparing state.

Kubernetes control plane endpoint should be pointed to the new control plane nodes if there were any changes to the node addresses.

Recovering from the Backup

Make sure all etcd service instances are in Preparing state:

$ talosctl -n <IP> service etcd
NODE     172.20.0.2
ID       etcd
STATE    Preparing
HEALTH   ?
EVENTS   [Preparing]: Running pre state (17s ago)
         [Waiting]: Waiting for service "cri" to be "up", time sync (18s ago)
         [Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", time sync (20s ago)

Execute the bootstrap command against any control plane node passing the path to the etcd database snapshot:

$ talosctl -n <IP> bootstrap --recover-from=./db.snapshot
recovering from snapshot "./db.snapshot": hash c25fd181, revision 4193, total keys 1287, total size 3035136

Note: if database snapshot was copied out directly from the etcd data directory using talosctl cp, add flag --recover-skip-hash-check to skip integrity check on restore.

Talos node should print matching information in the kernel log:

recovering etcd from snapshot: hash c25fd181, revision 4193, total keys 1287, total size 3035136
{"level":"info","msg":"restoring snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/li}
{"level":"info","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":3360}
{"level":"info","msg":"added member","cluster-id":"a3390e43eb5274e2","local-member-id":"0","added-peer-id":"eb4f6f534361855e","added-peer-peer-urls":["https:/}
{"level":"info","msg":"restored snapshot","path":"/var/lib/etcd.snapshot","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}

Now etcd service should become healthy on the bootstrap node, Kubernetes control plane components should start and control plane endpoint should become available. Remaining control plane nodes join etcd cluster once control plane endpoint is up.

Single Control Plane Node Cluster

This guide applies to the single control plane clusters as well. In fact, it is much more important to take regular snapshots of the etcd database in single control plane node case, as loss of the control plane node might render the whole cluster irrecoverable without a backup.

7.17 - Discovery

Video Walkthrough

To see a live demo of Cluster Discovery, see the video below:

Registries

Peers are aggregated from a number of optional registries. By default, Talos will use the kubernetes and service registries. Either one can be disabled. To disable a registry, set disabled to true (this option is the same for all registries): For example, to disable the service registry:

cluster:
  discovery:
    enabled: true
    registries:
      service:
        disabled: true

Disabling all registries effectively disables member discovery altogether.

As of v0.14, Talos supports the kubernetes and service registries.

Kubernetes registry uses Kubernetes Node resource data and additional Talos annotations:

$ kubectl describe node <nodename>
Annotations:        cluster.talos.dev/node-id: Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd
                    networking.talos.dev/assigned-prefixes: 10.244.0.0/32,10.244.0.1/24
                    networking.talos.dev/self-ips: 172.20.0.2,fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94
...

Service registry uses external Discovery Service to exchange encrypted information about cluster members.

Resource Definitions

Talos v0.14 introduces seven new resources that can be used to introspect the new discovery and KubeSpan features.

Discovery

Identities

The node’s unique identity (base62 encoded random 32 bytes) can be obtained with:

Note: Using base62 allows the ID to be URL encoded without having to use the ambiguous URL-encoding version of base64.

$ talosctl get identities -o yaml
...
spec:
    nodeId: Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd

Node identity is used as the unique Affiliate identifier.

Node identity resource is preserved in the STATE partition in node-identity.yaml file. Node identity is preserved across reboots and upgrades, but it is regenerated if the node is reset (wiped).

Affiliates

An affiliate is a proposed member attributed to the fact that the node has the same cluster ID and secret.

$ talosctl get affiliates
ID                                             VERSION   HOSTNAME                 MACHINE TYPE   ADDRESSES
2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF    2         talos-default-master-2   controlplane   ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]
6EVq8RHIne03LeZiJ60WsJcoQOtttw1ejvTS6SOBzhUA   2         talos-default-worker-1   worker         ["172.20.0.5","fd83:b1f7:fcb5:2802:cc80:3dff:fece:d89d"]
NVtfu1bT1QjhNq5xJFUZl8f8I8LOCnnpGrZfPpdN9WlB   2         talos-default-worker-2   worker         ["172.20.0.6","fd83:b1f7:fcb5:2802:2805:fbff:fe80:5ed2"]
Utoh3O0ZneV0kT2IUBrh7TgdouRcUW2yzaaMl4VXnCd    4         talos-default-master-1   controlplane   ["172.20.0.2","fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94"]
b3DebkPaCRLTLLWaeRF1ejGaR0lK3m79jRJcPn0mfA6C   2         talos-default-master-3   controlplane   ["172.20.0.4","fd83:b1f7:fcb5:2802:248f:1fff:fe5c:c3f"]

One of the Affiliates with the ID matching node identity is populated from the node data, other Affiliates are pulled from the registries. Enabled discovery registries run in parallel and discovered data is merged to build the list presented above.

Details about data coming from each registry can be queried from the cluster-raw namespace:

$ talosctl get affiliates --namespace=cluster-raw
ID                                                     VERSION   HOSTNAME                 MACHINE TYPE   ADDRESSES
k8s/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF        3         talos-default-master-2   controlplane   ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]
k8s/6EVq8RHIne03LeZiJ60WsJcoQOtttw1ejvTS6SOBzhUA       2         talos-default-worker-1   worker         ["172.20.0.5","fd83:b1f7:fcb5:2802:cc80:3dff:fece:d89d"]
k8s/NVtfu1bT1QjhNq5xJFUZl8f8I8LOCnnpGrZfPpdN9WlB       2         talos-default-worker-2   worker         ["172.20.0.6","fd83:b1f7:fcb5:2802:2805:fbff:fe80:5ed2"]
k8s/b3DebkPaCRLTLLWaeRF1ejGaR0lK3m79jRJcPn0mfA6C       3         talos-default-master-3   controlplane   ["172.20.0.4","fd83:b1f7:fcb5:2802:248f:1fff:fe5c:c3f"]
service/2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF    23        talos-default-master-2   controlplane   ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]
service/6EVq8RHIne03LeZiJ60WsJcoQOtttw1ejvTS6SOBzhUA   26        talos-default-worker-1   worker         ["172.20.0.5","fd83:b1f7:fcb5:2802:cc80:3dff:fece:d89d"]
service/NVtfu1bT1QjhNq5xJFUZl8f8I8LOCnnpGrZfPpdN9WlB   20        talos-default-worker-2   worker         ["172.20.0.6","fd83:b1f7:fcb5:2802:2805:fbff:fe80:5ed2"]
service/b3DebkPaCRLTLLWaeRF1ejGaR0lK3m79jRJcPn0mfA6C   14        talos-default-master-3   controlplane   ["172.20.0.4","fd83:b1f7:fcb5:2802:248f:1fff:fe5c:c3f"]

Each Affiliate ID is prefixed with k8s/ for data coming from the Kubernetes registry and with service/ for data coming from the discovery service.

Members

A member is an affiliate that has been approved to join the cluster. The members of the cluster can be obtained with:

$ talosctl get members
ID                       VERSION   HOSTNAME                 MACHINE TYPE   OS                ADDRESSES
talos-default-master-1   2         talos-default-master-1   controlplane   Talos (v0.14.0)   ["172.20.0.2","fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94"]
talos-default-master-2   1         talos-default-master-2   controlplane   Talos (v0.14.0)   ["172.20.0.3","fd83:b1f7:fcb5:2802:986b:7eff:fec5:889d"]
talos-default-master-3   1         talos-default-master-3   controlplane   Talos (v0.14.0)   ["172.20.0.4","fd83:b1f7:fcb5:2802:248f:1fff:fe5c:c3f"]
talos-default-worker-1   1         talos-default-worker-1   worker         Talos (v0.14.0)   ["172.20.0.5","fd83:b1f7:fcb5:2802:cc80:3dff:fece:d89d"]
talos-default-worker-2   1         talos-default-worker-2   worker         Talos (v0.14.0)   ["172.20.0.6","fd83:b1f7:fcb5:2802:2805:fbff:fe80:5ed2"]

7.18 - Disk Encryption

Guide on using system disk encryption

It is possible to enable encryption for system disks at the OS level. As of this writing, only STATE and EPHEMERAL partitions can be encrypted. STATE contains the most sensitive node data: secrets and certs. EPHEMERAL partition may contain some sensitive workload data. Data is encrypted using LUKS2, which is provided by the Linux kernel modules and cryptsetup utility. The operating system will run additional setup steps when encryption is enabled.

If the disk encryption is enabled for the STATE partition, the system will:

  • Save STATE encryption config as JSON in the META partition.
  • Before mounting the STATE partition, load encryption configs either from the machine config or from the META partition. Note that the machine config is always preferred over the META one.
  • Before mounting the STATE partition, format and encrypt it. This occurs only if the STATE partition is empty and has no filesystem.

If the disk encryption is enabled for the EPHEMERAL partition, the system will:

  • Get the encryption config from the machine config.
  • Before mounting the EPHEMERAL partition, encrypt and format it. This occurs only if the EPHEMERAL partition is empty and has no filesystem.

Configuration

Right now this encryption is disabled by default. To enable disk encryption you should modify the machine configuration with the following options:

machine:
  ...
  systemDiskEncryption:
    ephemeral:
      keys:
        - nodeID: {}
          slot: 0
    state:
      keys:
        - nodeID: {}
          slot: 0

Encryption Keys

Note: What the LUKS2 docs call “keys” are, in reality, a passphrase. When this passphrase is added, LUKS2 runs argon2 to create an actual key from that passphrase.

LUKS2 supports up to 32 encryption keys and it is possible to specify all of them in the machine configuration. Talos always tries to sync the keys list defined in the machine config with the actual keys defined for the LUKS2 partition. So if you update the keys list you should have at least one key that is not changed to be used for keys management.

When you define a key you should specify the key kind and the slot:

machine:
  ...
  state:
    keys:
      - nodeID: {} # key kind
        slot: 1

  ephemeral:
    keys:
      - static:
          passphrase: supersecret
        slot: 0

Take a note that key order does not play any role on which key slot is used. Every key must always have a slot defined.

Encryption Key Kinds

Talos supports two kinds of keys:

  • nodeID which is generated using the node UUID and the partition label (note that if the node UUID is not really random it will fail the entropy check).
  • static which you define right in the configuration.

Note: Use static keys only if your STATE partition is encrypted and only for the EPHEMERAL partition. For the STATE partition it will be stored in the META partition, which is not encrypted.

Key Rotation

It is necessary to do talosctl apply-config a couple of times to rotate keys, since there is a need to always maintain a single working key while changing the other keys around it.

So, for example, first add a new key:

machine:
  ...
  ephemeral:
    keys:
      - static:
          passphrase: oldkey
        slot: 0
      - static:
          passphrase: newkey
        slot: 1
  ...

Run:

talosctl apply-config -n <node> -f config.yaml

Then remove the old key:

machine:
  ...
  ephemeral:
    keys:
      - static:
          passphrase: newkey
        slot: 1
  ...

Run:

talosctl apply-config -n <node> -f config.yaml

Going from Unencrypted to Encrypted and Vice Versa

Ephemeral Partition

There is no in-place encryption support for the partitions right now, so to avoid losing any data only empty partitions can be encrypted.

As such, migration from unencrypted to encrypted needs some additional handling, especially around explicitly wiping partitions.

  • apply-config should be called with --on-reboot flag.
  • Partition should be wiped after apply-config, but before the reboot.

Edit your machine config and add the encryption configuration:

vim config.yaml

Apply the configuration with --on-reboot flag:

talosctl apply-config -f config.yaml -n <node ip> --on-reboot

Wipe the partition you’re going to encrypt:

talosctl reset --system-labels-to-wipe EPHEMERAL -n <node ip> --reboot=true

That’s it! After you run the last command, the partition will be wiped and the node will reboot. During the next boot the system will encrypt the partition.

State Partition

Calling wipe against the STATE partition will make the node lose the config, so the previous flow is not going to work.

The flow should be to first wipe the STATE partition:

talosctl reset  --system-labels-to-wipe STATE -n <node ip> --reboot=true

Node will enter into maintenance mode, then run apply-config with --insecure flag:

talosctl apply-config --insecure -n <node ip> -f config.yaml

After installation is complete the node should encrypt the STATE partition.

7.19 - Editing Machine Configuration

How to edit and patch Talos machine configuration, with reboot, immediately, or stage update on reboot.

Talos node state is fully defined by machine configuration. Initial configuration is delivered to the node at bootstrap time, but configuration can be updated while the node is running.

Note: Be sure that config is persisted so that configuration updates are not overwritten on reboots. Configuration persistence was enabled by default since Talos 0.5 (persist: true in machine configuration).

There are three talosctl commands which facilitate machine configuration updates:

  • talosctl apply-config to apply configuration from the file
  • talosctl edit machineconfig to launch an editor with existing node configuration, make changes and apply configuration back
  • talosctl patch machineconfig to apply automated machine configuration via JSON patch

Each of these commands can operate in one of three modes:

  • apply change with a reboot (default): update configuration, reboot Talos node to apply configuration change
  • apply change immediately (--immediate flag): change is applied immediately without a reboot, only .cluster sub-tree of the machine configuration can be updated in Talos 0.9
  • apply change on next reboot (--on-reboot): change is staged to be applied after a reboot, but node is not rebooted

Note: applying change on next reboot (--on-reboot) doesn’t modify current node configuration, so next call to talosctl edit machineconfig --on-reboot will not see changes

talosctl apply-config

This command is mostly used to submit initial machine configuration to the node (generated by talosctl gen config). It can be used to apply new configuration from the file to the running node as well, but most of the time it’s not convenient, as it doesn’t operate on the current node machine configuration.

Example:

talosctl -n <IP> apply-config -f config.yaml

Command apply-config can also be invoked as apply machineconfig:

talosctl -n <IP> apply machineconfig -f config.yaml

Applying machine configuration immediately (without a reboot):

talosctl -n IP apply machineconfig -f config.yaml --immediate

taloctl edit machineconfig

Command talosctl edit loads current machine configuration from the node and launches configured editor to modify the config. If config hasn’t been changed in the editor (or if updated config is empty), update is not applied.

Note: Talos uses environment variables TALOS_EDITOR, EDITOR to pick up the editor preference. If environment variables are missing, vi editor is used by default.

Example:

talosctl -n <IP> edit machineconfig

Configuration can be edited for multiple nodes if multiple IP addresses are specified:

talosctl -n <IP1>,<IP2>,... edit machineconfig

Applying machine configuration change immediately (without a reboot):

talosctl -n <IP> edit machineconfig --immediate

talosctl patch machineconfig

Command talosctl patch works similar to talosctl edit command - it loads current machine configuration, but instead of launching configured editor it applies JSON patch to the configuration and writes result back to the node.

Example, updating kubelet version (with a reboot):

$ talosctl -n <IP> patch machineconfig -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/talos-systems/kubelet:v1.20.5"}]'
patched mc at the node <IP>

Updating kube-apiserver version in immediate mode (without a reboot):

$ talosctl -n <IP> patch machineconfig --immediate -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "k8s.gcr.io/kube-apiserver:v1.20.5"}]'
patched mc at the node <IP>

Patch might be applied to multiple nodes when multiple IPs are specified:

taloctl -n <IP1>,<IP2>,... patch machineconfig --immediate -p '[{...}]'

Recovering from Node Boot Failures

If a Talos node fails to boot because of wrong configuration (for example, control plane endpoint is incorrect), configuration can be updated to fix the issue. If the boot sequence is still running, Talos might refuse applying config in default mode. In that case --on-reboot mode can be used coupled with talosctl reboot command to trigger a reboot and apply configuration update.

7.20 - KubeSpan

KubeSpan is a feature of Talos that automates the setup and maintenance of a full mesh WireGuard network for your cluster, giving you the ability to operate hybrid Kubernetes clusters that can span the edge, datacenter, and cloud. Management of keys and discovery of peers can be completely automated for a zero-touch experience that makes it simple and easy to create hybrid clusters.

Video Walkthrough

To learn more about KubeSpan, see the video below:

To see a live demo of KubeSpan, see one the videos below:

Enabling

Creating a New Cluster

To generate configuration files for a new cluster, we can use the --with-kubespan flag in talosctl gen config. This will enable peer discovery and KubeSpan.

...
    # Provides machine specific network configuration options.
    network:
        # Configures KubeSpan feature.
        kubespan:
            enabled: true # Enable the KubeSpan feature.
...
    # Configures cluster member discovery.
    discovery:
        enabled: true # Enable the cluster membership discovery feature.
        # Configure registries used for cluster member discovery.
        registries:
            # Kubernetes registry uses Kubernetes API server to discover cluster members and stores additional information
            kubernetes: {}
            # Service registry is using an external service to push and pull information about cluster members.
            service: {}
...
# Provides cluster specific configuration options.
cluster:
    id: yui150Ogam0pdQoNZS2lZR-ihi8EWxNM17bZPktJKKE= # Globally unique identifier for this cluster.
    secret: dAmFcyNmDXusqnTSkPJrsgLJ38W8oEEXGZKM0x6Orpc= # Shared secret of cluster.

The default discovery service is an external service hosted for free by Sidero Labs. The default value is https://discovery.talos.dev/. Contact Sidero Labs if you need to run this service privately.

Upgrading an Existing Cluster

In order to enable KubeSpan for an existing cluster, upgrade to the latest v0.14. Once your cluster is upgraded, the configuration of each node must contain the globally unique identifier, the shared secret for the cluster, and have KubeSpan and discovery enabled.

Note: Discovery can be used without KubeSpan, but KubeSpan requires at least one discovery registry.

Talos v0.11 or Less

If you are migrating from Talos v0.11 or less, we need to generate a cluster ID and secret.

To generate an id:

$ openssl rand -base64 32
EUsCYz+oHNuBppS51P9aKSIOyYvIPmbZK944PWgiyMQ=

To generate a secret:

$ openssl rand -base64 32
AbdsWjY9i797kGglghKvtGdxCsdllX9CemLq+WGVeaw=

Now, update the configuration of each node with the cluster with the generated id and secret. You should end up with the addition of something like this (your id and secret should be different):

cluster:
  id: EUsCYz+oHNuBppS51P9aKSIOyYvIPmbZK944PWgiyMQ=
  secret: AbdsWjY9i797kGglghKvtGdxCsdllX9CemLq+WGVeaw=

Note: This can be applied in immediate mode (no reboot required) by passing --immediate to either the edit machineconfig or apply-config subcommands.

Talos v0.12

Enable kubespan and discovery.

machine:
  network:
    kubespan:
      enabled: true
cluster:
  discovery:
    enabled: true

Resource Definitions

KubeSpanIdentities

A node’s WireGuard identities can be obtained with:

$ talosctl get kubespanidentities -o yaml
...
spec:
    address: fd83:b1f7:fcb5:2802:8c13:71ff:feaf:7c94/128
    subnet: fd83:b1f7:fcb5:2802::/64
    privateKey: gNoasoKOJzl+/B+uXhvsBVxv81OcVLrlcmQ5jQwZO08=
    publicKey: NzW8oeIH5rJyY5lefD9WRoHWWRr/Q6DwsDjMX+xKjT4=

Talos automatically configures unique IPv6 address for each node in the cluster-specific IPv6 ULA prefix.

Wireguard private key is generated for the node, private key never leaves the node while public key is published through the cluster discovery.

KubeSpanIdentity is persisted across reboots and upgrades in STATE partition in the file kubespan-identity.yaml.

KubeSpanPeerSpecs

A node’s WireGuard peers can be obtained with:

$ talosctl get kubespanpeerspecs
ID                                             VERSION   LABEL                    ENDPOINTS
06D9QQOydzKrOL7oeLiqHy9OWE8KtmJzZII2A5/FLFI=   2         talos-default-master-2   ["172.20.0.3:51820"]
THtfKtfNnzJs1nMQKs5IXqK0DFXmM//0WMY+NnaZrhU=   2         talos-default-master-3   ["172.20.0.4:51820"]
nVHu7l13uZyk0AaI1WuzL2/48iG8af4WRv+LWmAax1M=   2         talos-default-worker-2   ["172.20.0.6:51820"]
zXP0QeqRo+CBgDH1uOBiQ8tA+AKEQP9hWkqmkE/oDlc=   2         talos-default-worker-1   ["172.20.0.5:51820"]

The peer ID is the Wireguard public key. KubeSpanPeerSpecs are built from the cluster discovery data.

KubeSpanPeerStatuses

The status of a node’s WireGuard peers can be obtained with:

$ talosctl get kubespanpeerstatuses
ID                                             VERSION   LABEL                    ENDPOINT           STATE   RX         TX
06D9QQOydzKrOL7oeLiqHy9OWE8KtmJzZII2A5/FLFI=   63        talos-default-master-2   172.20.0.3:51820   up      15043220   17869488
THtfKtfNnzJs1nMQKs5IXqK0DFXmM//0WMY+NnaZrhU=   62        talos-default-master-3   172.20.0.4:51820   up      14573208   18157680
nVHu7l13uZyk0AaI1WuzL2/48iG8af4WRv+LWmAax1M=   60        talos-default-worker-2   172.20.0.6:51820   up      130072     46888
zXP0QeqRo+CBgDH1uOBiQ8tA+AKEQP9hWkqmkE/oDlc=   60        talos-default-worker-1   172.20.0.5:51820   up      130044     46556

KubeSpan peer status includes following information:

  • the actual endpoint used for peer communication
  • link state:
    • unknown: the endpoint was just changed, link state is not known yet
    • up: there is a recent handshake from the peer
    • down: there is no handshake from the peer
  • number of bytes sent/received over the Wireguard link with the peer

If the connection state goes down, Talos will be cycling through the available endpoints until it finds the one which works.

Peer status information is updated every 30 seconds.

KubeSpanEndpoints

A node’s WireGuard endpoints (peer addresses) can be obtained with:

$ talosctl get kubespanendpoints
ID                                             VERSION   ENDPOINT           AFFILIATE ID
06D9QQOydzKrOL7oeLiqHy9OWE8KtmJzZII2A5/FLFI=   1         172.20.0.3:51820   2VfX3nu67ZtZPl57IdJrU87BMjVWkSBJiL9ulP9TCnF
THtfKtfNnzJs1nMQKs5IXqK0DFXmM//0WMY+NnaZrhU=   1         172.20.0.4:51820   b3DebkPaCRLTLLWaeRF1ejGaR0lK3m79jRJcPn0mfA6C
nVHu7l13uZyk0AaI1WuzL2/48iG8af4WRv+LWmAax1M=   1         172.20.0.6:51820   NVtfu1bT1QjhNq5xJFUZl8f8I8LOCnnpGrZfPpdN9WlB
zXP0QeqRo+CBgDH1uOBiQ8tA+AKEQP9hWkqmkE/oDlc=   1         172.20.0.5:51820   6EVq8RHIne03LeZiJ60WsJcoQOtttw1ejvTS6SOBzhUA

The endpoint ID is the base64 encoded WireGuard public key.

The observed endpoints are submitted back to the discovery service (if enabled) so that other peers can try additional endpoints to establish the connection.

7.21 - Logging

Viewing logs

Kernel messages can be retrieved with talosctl dmesg command:

$ talosctl -n 172.20.1.2 dmesg

172.20.1.2: kern:    info: [2021-11-10T10:09:37.662764956Z]: Command line: init_on_alloc=1 slab_nomerge pti=on consoleblank=0 nvme_core.io_timeout=4294967295 random.trust_cpu=on printk.devkmsg=on ima_template=ima-ng ima_appraise=fix ima_hash=sha512 console=ttyS0 reboot=k panic=1 talos.shutdown=halt talos.platform=metal talos.config=http://172.20.1.1:40101/config.yaml
[...]

Service logs can be retrieved with talosctl logs command:

$ talosctl -n 172.20.1.2 services

NODE         SERVICE      STATE     HEALTH   LAST CHANGE   LAST EVENT
172.20.1.2   apid         Running   OK       19m27s ago    Health check successful
172.20.1.2   containerd   Running   OK       19m29s ago    Health check successful
172.20.1.2   cri          Running   OK       19m27s ago    Health check successful
172.20.1.2   etcd         Running   OK       19m22s ago    Health check successful
172.20.1.2   kubelet      Running   OK       19m20s ago    Health check successful
172.20.1.2   machined     Running   ?        19m30s ago    Service started as goroutine
172.20.1.2   trustd       Running   OK       19m27s ago    Health check successful
172.20.1.2   udevd        Running   OK       19m28s ago    Health check successful

$ talosctl -n 172.20.1.2 logs machined

172.20.1.2: [talos] task setupLogger (1/1): done, 106.109µs
172.20.1.2: [talos] phase logger (1/7): done, 564.476µs
[...]

Container logs for Kubernetes pods can be retrieved with talosctl logs -k command:

$ talosctl -n 172.20.1.2 containers -k
NODE         NAMESPACE   ID                                                 IMAGE                                                         PID    STATUS
172.20.1.2   k8s.io      kube-system/kube-flannel-dk6d5                     k8s.gcr.io/pause:3.5                                          1329   SANDBOX_READY
172.20.1.2   k8s.io      └─ kube-system/kube-flannel-dk6d5:install-cni      ghcr.io/talos-systems/install-cni:v0.7.0-alpha.0-1-g2bb2efc   0      CONTAINER_EXITED
172.20.1.2   k8s.io      └─ kube-system/kube-flannel-dk6d5:install-config   quay.io/coreos/flannel:v0.13.0                                0      CONTAINER_EXITED
172.20.1.2   k8s.io      └─ kube-system/kube-flannel-dk6d5:kube-flannel     quay.io/coreos/flannel:v0.13.0                                1610   CONTAINER_RUNNING
172.20.1.2   k8s.io      kube-system/kube-proxy-gfkqj                       k8s.gcr.io/pause:3.5                                          1311   SANDBOX_READY
172.20.1.2   k8s.io      └─ kube-system/kube-proxy-gfkqj:kube-proxy         k8s.gcr.io/kube-proxy:v1.23.0                                 1379   CONTAINER_RUNNING

$ talosctl -n 172.20.1.2 logs -k kube-system/kube-proxy-gfkqj:kube-proxy
172.20.1.2: 2021-11-30T19:13:20.567825192Z stderr F I1130 19:13:20.567737       1 server_others.go:138] "Detected node IP" address="172.20.0.3"
172.20.1.2: 2021-11-30T19:13:20.599684397Z stderr F I1130 19:13:20.599613       1 server_others.go:206] "Using iptables Proxier"
[...]

Sending logs

Service logs

You can enable logs sendings in machine configuration:

machine:
  logging:
    destinations:
      - endpoint: "udp://127.0.0.1:12345/"
        format: "json_lines"
      - endpoint: "tcp://host:5044/"
        format: "json_lines"

Several destinations can be specified. Supported protocols are UDP and TCP. The only currently supported format is json_lines:

{
  "msg": "[talos] apply config request: immediate true, on reboot false",
  "talos-level": "info",
  "talos-service": "machined",
  "talos-time": "2021-11-10T10:48:49.294858021Z"
}

Messages are newline-separated when sent over TCP. Over UDP messages are sent with one message per packet. msg, talos-level, talos-service, and talos-time fields are always present; there may be additional fields.

Kernel logs

Kernel log delivery can be enabled with the talos.logging.kernel kernel command line argument, which can be specified in the .machine.installer.extraKernelArgs:

machine:
  install:
    extraKernelArgs:
      - talos.logging.kernel=tcp://host:5044/

Kernel log destination is specified in the same way as service log endpoint. The only supported format is json_lines.

Sample message:

{
  "clock":6252819, // time relative to the kernel boot time
  "facility":"user",
  "msg":"[talos] task startAllServices (1/1): waiting for 6 services\n",
  "priority":"warning",
  "seq":711,
  "talos-level":"warn", // Talos-translated `priority` into common logging level
  "talos-time":"2021-11-26T16:53:21.3258698Z" // Talos-translated `clock` using current time
}

extraKernelArgs in the machine configuration are only applied on Talos upgrades, not just by applying the config. (Upgrading to the same version is fine).

Filebeat example

To forward logs to other Log collection services, one way to do this is sending them to a Filebeat running in the cluster itself (in the host network), which takes care of forwarding it to other endpoints (and the necessary transformations).

If Elastic Cloud on Kubernetes is being used, the following Beat (custom resource) configuration might be helpful:

apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: talos
spec:
  type: filebeat
  version: 7.15.1
  elasticsearchRef:
    name: talos
  config:
    filebeat.inputs:
      - type: "udp"
        host: "127.0.0.1:12345"
        processors:
          - decode_json_fields:
              fields: ["message"]
              target: ""
          - timestamp:
              field: "talos-time"
              layouts:
                - "2006-01-02T15:04:05.999999999Z07:00"
          - drop_fields:
              fields: ["message", "talos-time"]
          - rename:
              fields:
                - from: "msg"
                  to: "message"

  daemonSet:
    updateStrategy:
      rollingUpdate:
        maxUnavailable: 100%
    podTemplate:
      spec:
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true
        securityContext:
          runAsUser: 0
        containers:
          - name: filebeat
            ports:
              - protocol: UDP
                containerPort: 12345
                hostPort: 12345

The input configuration ensures that messages and timestamps are extracted properly. Refer to the Filebeat documentation on how to forward logs to other outputs.

Also note the hostNetwork: true in the daemonSet configuration.

This ensures filebeat uses the host network, and listens on 127.0.0.1:12345 (UDP) on every machine, which can then be specified as a logging endpoint in the machine configuration.

7.22 - Managing PKI

Generating an Administrator Key Pair

In order to create a key pair, you will need the root CA.

Save the CA public key, and CA private key as ca.crt, and ca.key respectively. Now, run the following commands to generate a certificate:

talosctl gen key --name admin
talosctl gen csr --key admin.key --ip 127.0.0.1
talosctl gen crt --ca ca --csr admin.csr --name admin

Now, base64 encode admin.crt, and admin.key:

cat admin.crt | base64
cat admin.key | base64

You can now set the crt and key fields in the talosconfig to the base64 encoded strings.

Renewing an Expired Administrator Certificate

In order to renew the certificate, you will need the root CA, and the admin private key. The base64 encoded key can be found in any one of the control plane node’s configuration file. Where it is exactly will depend on the specific version of the configuration file you are using.

Save the CA public key, CA private key, and admin private key as ca.crt, ca.key, and admin.key respectively. Now, run the following commands to generate a certificate:

talosctl gen csr --key admin.key --ip 127.0.0.1
talosctl gen crt --ca ca --csr admin.csr --name admin

You should see admin.crt in your current directory. Now, base64 encode admin.crt:

cat admin.crt | base64

You can now set the certificate in the talosconfig to the base64 encoded string.

7.23 - Resetting a Machine

From time to time, it may be beneficial to reset a Talos machine to its “original” state. Bear in mind that this is a destructive action for the given machine. Doing this means removing the machine from Kubernetes, Etcd (if applicable), and clears any data on the machine that would normally persist a reboot.

WARNING: Running a talosctl reset on cloud VM’s might result in the VM being unable to boot as this wipes the entire disk. It might be more useful to just wipe the STATE and EPHEMERAL partitions on a cloud VM if not booting via iPXE. talosctl reset --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL

The API command for doing this is talosctl reset. There are a couple of flags as part of this command:

Flags:
      --graceful                        if true, attempt to cordon/drain node and leave etcd (if applicable) (default true)
      --reboot                          if true, reboot the node after resetting instead of shutting down
      --system-labels-to-wipe strings   if set, just wipe selected system disk partitions by label but keep other partitions intact keep other partitions intact

The graceful flag is especially important when considering HA vs. non-HA Talos clusters. If the machine is part of an HA cluster, a normal, graceful reset should work just fine right out of the box as long as the cluster is in a good state. However, if this is a single node cluster being used for testing purposes, a graceful reset is not an option since Etcd cannot be “left” if there is only a single member. In this case, reset should be used with --graceful=false to skip performing checks that would normally block the reset.

7.24 - Role-based access control (RBAC)

Talos v0.11 introduced initial support for role-based access control (RBAC). This guide will explain what that is and how to enable it without losing access to the cluster.

RBAC in Talos

Talos uses certificates to authorize users. The certificate subject’s organization field is used to encode user roles. There is a set of predefined roles that allow access to different API methods:

  • os:admin grants access to all methods;
  • os:reader grants access to “safe” methods (for example, that includes the ability to list files, but does not include the ability to read files content);
  • os:etcd:backup grants access to /machine.MachineService/EtcdSnapshot method.

Roles in the current talosconfig can be checked with the following command:

$ talosctl config info

[...]
Roles:               os:admin
[...]

RBAC is enabled by default in new clusters created with talosctl v0.11+ and disabled otherwise.

Enabling RBAC

First, both the Talos cluster and talosctl tool should be upgraded. Then the talosctl config new command should be used to generate a new client configuration with the os:admin role. Additional configurations and certificates for different roles can be generated by passing --roles flag:

talosctl config new --roles=os:reader reader

That command will create a new client configuration file reader with a new certificate with os:reader role.

After that, RBAC should be enabled in the machine configuration:

machine:
  features:
    rbac: true

7.25 - Storage

In Kubernetes, using storage in the right way is well-facilitated by the API.
However, unless you are running in a major public cloud, that API may not be hooked up to anything. This frequently sends users down a rabbit hole of researching all the various options for storage backends for their platform, for Kubernetes, and for their workloads. There are a lot of options out there, and it can be fairly bewildering.

For Talos, we try to limit the options somewhat to make the decision-making easier.

Public Cloud

If you are running on a major public cloud, use their block storage. It is easy and automatic.

Storage Clusters

Talos recommends having a separate disks (apart from the Talos install disk) to be used for storage.

Redundancy in storage is usually very important. Scaling capabilities, reliability, speed, maintenance load, and ease of use are all factors you must consider when managing your own storage.

Running a storage cluster can be a very good choice when managing your own storage, and there are two project we recommend, depending on your situation.

If you need vast amounts of storage composed of more than a dozen or so disks, just use Rook to manage Ceph. Also, if you need both mount-once and mount-many capabilities, Ceph is your answer. Ceph also bundles in an S3-compatible object store. The down side of Ceph is that there are a lot of moving parts.

Please note that most people should never use mount-many semantics. NFS is pervasive because it is old and easy, not because it is a good idea. While it may seem like a convenience at first, there are all manner of locking, performance, change control, and reliability concerns inherent in any mount-many situation, so we strongly recommend you avoid this method.

If your storage needs are small enough to not need Ceph, use Mayastor.

Rook/Ceph

Ceph is the grandfather of open source storage clusters. It is big, has a lot of pieces, and will do just about anything. It scales better than almost any other system out there, open source or proprietary, being able to easily add and remove storage over time with no downtime, safely and easily. It comes bundled with RadosGW, an S3-compatible object store. It comes with CephFS, a NFS-like clustered filesystem. And of course, it comes with RBD, a block storage system.

With the help of Rook, the vast majority of the complexity of Ceph is hidden away by a very robust operator, allowing you to control almost everything about your Ceph cluster from fairly simple Kubernetes CRDs.

So if Ceph is so great, why not use it for everything?

Ceph can be rather slow for small clusters. It relies heavily on CPUs and massive parallelisation to provide good cluster performance, so if you don’t have much of those dedicated to Ceph, it is not going to be well-optimised for you. Also, if your cluster is small, just running Ceph may eat up a significant amount of the resources you have available.

Troubleshooting Ceph can be difficult if you do not understand its architecture. There are lots of acronyms and the documentation assumes a fair level of knowledge. There are very good tools for inspection and debugging, but this is still frequently seen as a concern.

Mayastor

Mayastor is an OpenEBS project built in Rust utilising the modern NVMEoF system. (Despite the name, Mayastor does not require you to have NVME drives.) It is fast and lean but still cluster-oriented and cloud native. Unlike most of the other OpenEBS project, it is not built on the ancient iSCSI system.

Unlike Ceph, Mayastor is just a block store. It focuses on block storage and does it well. It is much less complicated to set up than Ceph, but you probably wouldn’t want to use it for more than a few dozen disks.

Mayastor is new, maybe too new. If you’re looking for something well-tested and battle-hardened, this is not it. If you’re looking for something lean, future-oriented, and simpler than Ceph, it might be a great choice.

Video Walkthrough

To see a live demo of this section, see the video below:

Prep Nodes

Either during initial cluster creation or on running worker nodes, several machine config values should be edited. This information is gathered from the Mayastor documentation. We need to set the vm.nr_hugepages sysctl and add openebs.io/engine=mayastor labels to the nodes which are meant to be storage nodes This can be done with talosctl patch machineconfig or via config patches during talosctl gen config.

Some examples are shown below, modify as needed.

Using gen config

talosctl gen config my-cluster https://mycluster.local:6443 --config-patch '[{"op": "add", "path": "/machine/sysctls", "value": {"vm.nr_hugepages": "1024"}}, {"op": "add", "path": "/machine/kubelet/extraArgs", "value": {"node-labels": "openebs.io/engine=mayastor"}}]'

Patching an existing node

talosctl patch --immediate machineconfig -n <node ip> --patch '[{"op": "add", "path": "/machine/sysctls", "value": {"vm.nr_hugepages": "1024"}}, {"op": "add", "path": "/machine/kubelet/extraArgs", "value": {"node-labels": "openebs.io/engine=mayastor"}}]'

Note: If you are adding/updating the vm.nr_hugepages on a node which already had the openebs.io/engine=mayastor label set, you’d need to restart kubelet so that it picks up the new value, by issuing the following command

talosctl -n <node ip> service kubelet restart

Deploy Mayastor

Continue setting up Mayastor using the official documentation.

NFS

NFS is an old pack animal long past its prime. However, it is supported by a wide variety of systems. You don’t want to use it unless you have to, but unfortunately, that “have to” is too frequent.

NFS is slow, has all kinds of bottlenecks involving contention, distributed locking, single points of service, and more.

The NFS client is part of the kubelet image maintained by the Talos team. This means that the version installed in your running kubelet is the version of NFS supported by Talos. You can reduce some of the contention problems by parceling Persistent Volumes from separate underlying directories.

Object storage

Ceph comes with an S3-compatible object store, but there are other options, as well. These can often be built on top of other storage backends. For instance, you may have your block storage running with Mayastor but assign a Pod a large Persistent Volume to serve your object store.

One of the most popular open source add-on object stores is MinIO.

Others (iSCSI)

The most common remaining systems involve iSCSI in one form or another. This includes things like the original OpenEBS, Rancher’s Longhorn, and many proprietary systems. Unfortunately, Talos does not support iSCSI-based systems. iSCSI in Linux is facilitated by open-iscsi. This system was designed long before containers caught on, and it is not well suited to the task, especially when coupled with a read-only host operating system.

One day, we hope to work out a solution for facilitating iSCSI-based systems, but this is not yet available.

7.26 - Troubleshooting Control Plane

Troubleshoot control plane failures for running cluster and bootstrap process.

This guide is written as series of topics and detailed answers for each topic. It starts with basics of control plane and goes into Talos specifics.

In this guide we assume that Talos client config is available and Talos API access is available. Kubernetes client configuration can be pulled from control plane nodes with talosctl -n <IP> kubeconfig (this command works before Kubernetes is fully booted).

What is a control plane node?

Talos nodes which have .machine.type of init and controlplane are control plane nodes.

The only difference between init and controlplane nodes is that init node automatically bootstraps a single-node etcd cluster on a first boot if the etcd data directory is empty. A node with type init can be replaced with a controlplane node which is triggered to run etcd bootstrap with talosctl --nodes <IP> bootstrap command.

Use of init type nodes is discouraged, as it might lead to split-brain scenario if one node in existing cluster is reinstalled while config type is still init.

It is critical to make sure only one control plane runs in bootstrap mode (either with node type init or via bootstrap API/talosctl bootstrap), as having more than node in bootstrap mode leads to split-brain scenario (multiple etcd clusters are built instead of a single cluster).

What is special about control plane node?

Control plane nodes in Talos run etcd which provides data store for Kubernetes and Kubernetes control plane components (kube-apiserver, kube-controller-manager and kube-scheduler).

Control plane nodes are tainted by default to prevent workloads from being scheduled to control plane nodes.

How many control plane nodes should be deployed?

With a single control plane node, cluster is not HA: if that single node experiences hardware failure, cluster control plane is broken and can’t be recovered. Single control plane node clusters are still used as test clusters and in edge deployments, but it should be noted that this setup is not HA.

Number of control plane should be odd (1, 3, 5, …), as with even number of nodes, etcd quorum doesn’t tolerate failures correctly: e.g. with 2 control plane nodes quorum is 2, so failure of any node breaks quorum, so this setup is almost equivalent to single control plane node cluster.

With three control plane nodes cluster can tolerate a failure of any single control plane node. With five control plane nodes cluster can tolerate failure of any two control plane nodes.

What is control plane endpoint?

Kubernetes requires having a control plane endpoint which points to any healthy API server running on a control plane node. Control plane endpoint is specified as URL like https://endpoint:6443/. At any point in time, even during failures control plane endpoint should point to a healthy API server instance. As kube-apiserver runs with host network, control plane endpoint should point to one of the control plane node IPs: node1:6443, node2:6443, …

For single control plane node clusters, control plane endpoint might be https://IP:6443/ or https://DNS:6443/, where IP is the IP of the control plane node and DNS points to IP. DNS form of the endpoint allows to change the IP address of the control plane if that IP changes over time.

For HA clusters, control plane can be implemented as:

  • TCP L7 loadbalancer with active health checks against port 6443
  • round-robin DNS with active health checks against port 6443
  • BGP anycast IP with health checks
  • virtual shared L2 IP

It is critical that control plane endpoint works correctly during cluster bootstrap phase, as nodes discover each other using control plane endpoint.

kubelet is not running on control plane node

Service kubelet should be running on control plane node as soon as networking is configured:

$ talosctl -n <IP> service kubelet
NODE     172.20.0.2
ID       kubelet
STATE    Running
HEALTH   OK
EVENTS   [Running]: Health check successful (2m54s ago)
         [Running]: Health check failed: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused (3m4s ago)
         [Running]: Started task kubelet (PID 2334) for container kubelet (3m6s ago)
         [Preparing]: Creating service runner (3m6s ago)
         [Preparing]: Running pre state (3m15s ago)
         [Waiting]: Waiting for service "timed" to be "up" (3m15s ago)
         [Waiting]: Waiting for service "cri" to be "up", service "timed" to be "up" (3m16s ago)
         [Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up" (3m18s ago)

If kubelet is not running, it might be caused by wrong configuration, check kubelet logs with talosctl logs:

$ talosctl -n <IP> logs kubelet
172.20.0.2: I0305 20:45:07.756948    2334 controller.go:101] kubelet config controller: starting controller
172.20.0.2: I0305 20:45:07.756995    2334 controller.go:267] kubelet config controller: ensuring filesystem is set up correctly
172.20.0.2: I0305 20:45:07.757000    2334 fsstore.go:59] kubelet config controller: initializing config checkpoints directory "/etc/kubernetes/kubelet/store"

etcd is not running on bootstrap node

etcd should be running on bootstrap node immediately (bootstrap node is either init node or controlplane node after talosctl bootstrap command was issued). When node boots for the first time, etcd data directory /var/lib/etcd directory is empty and Talos launches etcd in a mode to build the initial cluster of a single node. At this time /var/lib/etcd directory becomes non-empty and etcd runs as usual.

If etcd is not running, check service etcd state:

$ talosctl -n <IP> service etcd
NODE     172.20.0.2
ID       etcd
STATE    Running
HEALTH   OK
EVENTS   [Running]: Health check successful (3m21s ago)
         [Running]: Started task etcd (PID 2343) for container etcd (3m26s ago)
         [Preparing]: Creating service runner (3m26s ago)
         [Preparing]: Running pre state (3m26s ago)
         [Waiting]: Waiting for service "cri" to be "up", service "networkd" to be "up", service "timed" to be "up" (3m26s ago)

If service is stuck in Preparing state for bootstrap node, it might be related to slow network - at this stage Talos pulls etcd image from the container registry.

If etcd service is crashing and restarting, check service logs with talosctl -n <IP> logs etcd. Most common reasons for crashes are:

  • wrong arguments passed via extraArgs in the configuration;
  • booting Talos on non-empty disk with previous Talos installation, /var/lib/etcd contains data from old cluster.

etcd is not running on non-bootstrap control plane node

Service etcd on non-bootstrap control plane node waits for Kubernetes to boot successfully on bootstrap node to find other peers to build a cluster. As soon as bootstrap node boots Kubernetes control plane components, and kubectl get endpoints returns IP of bootstrap control plane node, other control plane nodes will start joining the cluster followed by Kubernetes control plane components on each control plane node.

Kubernetes static pod definitions are not generated

Talos should write down static pod definitions for the Kubernetes control plane:

$ talosctl -n <IP> ls /etc/kubernetes/manifests
NODE         NAME
172.20.0.2   .
172.20.0.2   talos-kube-apiserver.yaml
172.20.0.2   talos-kube-controller-manager.yaml
172.20.0.2   talos-kube-scheduler.yaml

If static pod definitions are not rendered, check etcd and kubelet service health (see above), and controller runtime logs (talosctl logs controller-runtime).

Talos prints error an error on the server ("") has prevented the request from succeeding

This is expected during initial cluster bootstrap and sometimes after a reboot:

[   70.093289] [talos] task labelNodeAsMaster (1/1): starting
[   80.094038] [talos] retrying error: an error on the server ("") has prevented the request from succeeding (get nodes talos-default-master-1)

Initially kube-apiserver component is not running yet, and it takes some time before it becomes fully up during bootstrap (image should be pulled from the Internet, etc.) Once control plane endpoint is up Talos should proceed.

If Talos doesn’t proceed further, it might be a configuration issue.

In any case, status of control plane components can be checked with talosctl containers -k:

$ talosctl -n <IP> containers --kubernetes
NODE         NAMESPACE   ID                                                                                      IMAGE                                        PID    STATUS
172.20.0.2   k8s.io      kube-system/kube-apiserver-talos-default-master-1                                       k8s.gcr.io/pause:3.2                         2539   SANDBOX_READY
172.20.0.2   k8s.io      └─ kube-system/kube-apiserver-talos-default-master-1:kube-apiserver                     k8s.gcr.io/kube-apiserver:v1.20.4            2572   CONTAINER_RUNNING

If kube-apiserver shows as CONTAINER_EXITED, it might have exited due to configuration error. Logs can be checked with taloctl logs --kubernetes (or with -k as a shorthand):

$ talosctl -n <IP> logs -k kube-system/kube-apiserver-talos-default-master-1:kube-apiserver
172.20.0.2: 2021-03-05T20:46:13.133902064Z stderr F 2021/03/05 20:46:13 Running command:
172.20.0.2: 2021-03-05T20:46:13.133933824Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)
172.20.0.2: 2021-03-05T20:46:13.133938524Z stderr F Run from directory:
172.20.0.2: 2021-03-05T20:46:13.13394154Z stderr F Executable path: /usr/local/bin/kube-apiserver
...

Talos prints error nodes "talos-default-master-1" not found

This error means that kube-apiserver is up, and control plane endpoint is healthy, but kubelet hasn’t got its client certificate yet and wasn’t able to register itself.

For the kubelet to get its client certificate, following conditions should apply:

  • control plane endpoint is healthy (kube-apiserver is running)
  • bootstrap manifests got successfully deployed (for CSR auto-approval)
  • kube-controller-manager is running

CSR state can be checked with kubectl get csr:

$ kubectl get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                 CONDITION
csr-jcn9j   14m   kubernetes.io/kube-apiserver-client-kubelet   system:bootstrap:q9pyzr   Approved,Issued
csr-p6b9q   14m   kubernetes.io/kube-apiserver-client-kubelet   system:bootstrap:q9pyzr   Approved,Issued
csr-sw6rm   14m   kubernetes.io/kube-apiserver-client-kubelet   system:bootstrap:q9pyzr   Approved,Issued
csr-vlghg   14m   kubernetes.io/kube-apiserver-client-kubelet   system:bootstrap:q9pyzr   Approved,Issued

Talos prints error node not ready

Node in Kubernetes is marked as Ready once CNI is up. It takes a minute or two for the CNI images to be pulled and for the CNI to start. If the node is stuck in this state for too long, check CNI pods and logs with kubectl, usually CNI resources are created in kube-system namespace. For example, for Talos default Flannel CNI:

$ kubectl -n kube-system get pods
NAME                                             READY   STATUS    RESTARTS   AGE
...
kube-flannel-25drx                               1/1     Running   0          23m
kube-flannel-8lmb6                               1/1     Running   0          23m
kube-flannel-gl7nx                               1/1     Running   0          23m
kube-flannel-jknt9                               1/1     Running   0          23m
...

Talos prints error x509: certificate signed by unknown authority

Full error might look like:

x509: certificate signed by unknown authority (possiby because of crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes"

Commonly, the control plane endpoint points to a different cluster, as the client certificate generated by Talos doesn’t match CA of the cluster at control plane endpoint.

etcd is running on bootstrap node, but stuck in pre state on non-bootstrap nodes

Please see question etcd is not running on non-bootstrap control plane node.

Checking kube-controller-manager and kube-scheduler

If control plane endpoint is up, status of the pods can be performed with kubectl:

$ kubectl get pods -n kube-system -l k8s-app=kube-controller-manager
NAME                                             READY   STATUS    RESTARTS   AGE
kube-controller-manager-talos-default-master-1   1/1     Running   0          28m
kube-controller-manager-talos-default-master-2   1/1     Running   0          28m
kube-controller-manager-talos-default-master-3   1/1     Running   0          28m

If control plane endpoint is not up yet, container status can be queried with talosctl containers --kubernetes:

$ talosctl -n <IP> c -k
NODE         NAMESPACE   ID                                                                                      IMAGE                                        PID    STATUS
...
172.20.0.2   k8s.io      kube-system/kube-controller-manager-talos-default-master-1                              k8s.gcr.io/pause:3.2                         2547   SANDBOX_READY
172.20.0.2   k8s.io      └─ kube-system/kube-controller-manager-talos-default-master-1:kube-controller-manager   k8s.gcr.io/kube-controller-manager:v1.20.4   2580   CONTAINER_RUNNING
172.20.0.2   k8s.io      kube-system/kube-scheduler-talos-default-master-1                                       k8s.gcr.io/pause:3.2                         2638   SANDBOX_READY
172.20.0.2   k8s.io      └─ kube-system/kube-scheduler-talos-default-master-1:kube-scheduler                     k8s.gcr.io/kube-scheduler:v1.20.4            2670   CONTAINER_RUNNING
...

If some of the containers are not running, it could be that image is still being pulled. Otherwise process might crashing, in that case logs can be checked with talosctl logs --kubernetes <containerID>:

$ talosctl -n <IP> logs -k kube-system/kube-controller-manager-talos-default-master-1:kube-controller-manager
172.20.0.3: 2021-03-09T13:59:34.291667526Z stderr F 2021/03/09 13:59:34 Running command:
172.20.0.3: 2021-03-09T13:59:34.291702262Z stderr F Command env: (log-file=, also-stdout=false, redirect-stderr=true)
172.20.0.3: 2021-03-09T13:59:34.291707121Z stderr F Run from directory:
172.20.0.3: 2021-03-09T13:59:34.291710908Z stderr F Executable path: /usr/local/bin/kube-controller-manager
172.20.0.3: 2021-03-09T13:59:34.291719163Z stderr F Args (comma-delimited): /usr/local/bin/kube-controller-manager,--allocate-node-cidrs=true,--cloud-provider=,--cluster-cidr=10.244.0.0/16,--service-cluster-ip-range=10.96.0.0/12,--cluster-signing-cert-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--cluster-signing-key-file=/system/secrets/kubernetes/kube-controller-manager/ca.key,--configure-cloud-routes=false,--kubeconfig=/system/secrets/kubernetes/kube-controller-manager/kubeconfig,--leader-elect=true,--root-ca-file=/system/secrets/kubernetes/kube-controller-manager/ca.crt,--service-account-private-key-file=/system/secrets/kubernetes/kube-controller-manager/service-account.key,--profiling=false
172.20.0.3: 2021-03-09T13:59:34.293870359Z stderr F 2021/03/09 13:59:34 Now listening for interrupts
172.20.0.3: 2021-03-09T13:59:34.761113762Z stdout F I0309 13:59:34.760982      10 serving.go:331] Generated self-signed cert in-memory
...

Checking controller runtime logs

Talos runs a set of controllers which work on resources to build and support Kubernetes control plane.

Some debugging information can be queried from the controller logs with talosctl logs controller-runtime:

$ talosctl -n <IP> logs controller-runtime
172.20.0.2: 2021/03/09 13:57:11  secrets.EtcdController: controller starting
172.20.0.2: 2021/03/09 13:57:11  config.MachineTypeController: controller starting
172.20.0.2: 2021/03/09 13:57:11  k8s.ManifestApplyController: controller starting
172.20.0.2: 2021/03/09 13:57:11  v1alpha1.BootstrapStatusController: controller starting
172.20.0.2: 2021/03/09 13:57:11  v1alpha1.TimeStatusController: controller starting
...

Controllers run reconcile loop, so they might be starting, failing and restarting, that is expected behavior. Things to look for:

v1alpha1.BootstrapStatusController: bootkube initialized status not found: control plane is not self-hosted, running with static pods.

k8s.KubeletStaticPodController: writing static pod "/etc/kubernetes/manifests/talos-kube-apiserver.yaml": static pod definitions were rendered successfully.

k8s.ManifestApplyController: controller failed: error creating mapping for object /v1/Secret/bootstrap-token-q9pyzr: an error on the server ("") has prevented the request from succeeding: control plane endpoint is not up yet, bootstrap manifests can’t be injected, controller is going to retry.

k8s.KubeletStaticPodController: controller failed: error refreshing pod status: error fetching pod status: an error on the server ("Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)") has prevented the request from succeeding: kubelet hasn’t been able to contact kube-apiserver yet to push pod status, controller is going to retry.

k8s.ManifestApplyController: created rbac.authorization.k8s.io/v1/ClusterRole/psp:privileged: one of the bootstrap manifests got successfully applied.

secrets.KubernetesController: controller failed: missing cluster.aggregatorCA secret: Talos is running with 0.8 configuration, if the cluster was upgraded from 0.8, this is expected, and conversion process will fix machine config automatically. If this cluster was bootstrapped with version 0.9, machine configuration should be regenerated with 0.9 talosctl.

If there are no new messages in controller-runtime log, it means that controllers finished reconciling successfully.

Checking static pod definitions

Talos generates static pod definitions for kube-apiserver, kube-controller-manager, and kube-scheduler components based on machine configuration. These definitions can be checked as resources with talosctl get staticpods:

$ talosctl -n <IP> get staticpods -o yaml
get staticpods -o yaml
node: 172.20.0.2
metadata:
    namespace: controlplane
    type: StaticPods.kubernetes.talos.dev
    id: kube-apiserver
    version: 2
    phase: running
    finalizers:
        - k8s.StaticPodStatus("kube-apiserver")
spec:
    apiVersion: v1
    kind: Pod
    metadata:
        annotations:
            talos.dev/config-version: "1"
            talos.dev/secrets-version: "1"
        creationTimestamp: null
        labels:
            k8s-app: kube-apiserver
            tier: control-plane
        name: kube-apiserver
        namespace: kube-system
...

Status of the static pods can queried with talosctl get staticpodstatus:

$ talosctl -n <IP> get staticpodstatus
NODE         NAMESPACE      TYPE              ID                                                           VERSION   READY
172.20.0.2   controlplane   StaticPodStatus   kube-system/kube-apiserver-talos-default-master-1            1         True
172.20.0.2   controlplane   StaticPodStatus   kube-system/kube-controller-manager-talos-default-master-1   1         True
172.20.0.2   controlplane   StaticPodStatus   kube-system/kube-scheduler-talos-default-master-1            1         True

Most important status is Ready printed as last column, complete status can be fetched by adding -o yaml flag.

Checking bootstrap manifests

As part of bootstrap process, Talos injects bootstrap manifests into Kubernetes API server. There are two kinds of manifests: system manifests built-in into Talos and extra manifests downloaded (custom CNI, extra manifests in the machine config):

$ talosctl -n <IP> get manifests
NODE         NAMESPACE      TYPE       ID                               VERSION
172.20.0.2   controlplane   Manifest   00-kubelet-bootstrapping-token   1
172.20.0.2   controlplane   Manifest   01-csr-approver-role-binding     1
172.20.0.2   controlplane   Manifest   01-csr-node-bootstrap            1
172.20.0.2   controlplane   Manifest   01-csr-renewal-role-binding      1
172.20.0.2   controlplane   Manifest   02-kube-system-sa-role-binding   1
172.20.0.2   controlplane   Manifest   03-default-pod-security-policy   1
172.20.0.2   controlplane   Manifest   05-https://docs.projectcalico.org/manifests/calico.yaml   1
172.20.0.2   controlplane   Manifest   10-kube-proxy                    1
172.20.0.2   controlplane   Manifest   11-core-dns                      1
172.20.0.2   controlplane   Manifest   11-core-dns-svc                  1
172.20.0.2   controlplane   Manifest   11-kube-config-in-cluster        1

Details of each manifests can be queried by adding -o yaml:

$ talosctl -n <IP> get manifests 01-csr-approver-role-binding --namespace=controlplane -o yaml
node: 172.20.0.2
metadata:
    namespace: controlplane
    type: Manifests.kubernetes.talos.dev
    id: 01-csr-approver-role-binding
    version: 1
    phase: running
spec:
    - apiVersion: rbac.authorization.k8s.io/v1
      kind: ClusterRoleBinding
      metadata:
        name: system-bootstrap-approve-node-client-csr
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: system:certificates.k8s.io:certificatesigningrequests:nodeclient
      subjects:
        - apiGroup: rbac.authorization.k8s.io
          kind: Group
          name: system:bootstrappers

Worker node is stuck with apid health check failures

Control plane nodes have enough secret material to generate apid server certificates, but worker nodes depend on control plane trustd services to generate certificates. Worker nodes wait for kubelet to join the cluster, then apid queries Kubernetes endpoints via control plane endpoint to find trustd endpoints, and use trustd to issue the certficiate.

So if apid health checks is failing on worker node:

  • make sure control plane endpoint is healthy
  • check that worker node kubelet joined the cluster

7.27 - Upgrading Kubernetes

This guide covers Kubernetes control plane upgrade for clusters running Talos-managed control plane. If the cluster is still running self-hosted control plane (after upgrade from Talos 0.8), please refer to 0.8 docs.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Automated Kubernetes Upgrade

To check what is going to be upgraded you can run talosctl upgrade-k8s with --dry-run flag:

$ talosctl --nodes <master node> upgrade-k8s --to 1.23.0 --dry-run
WARNING: found resources which are going to be deprecated/migrated in the version 1.22.0
RESOURCE                                                               COUNT
validatingwebhookconfigurations.v1beta1.admissionregistration.k8s.io   4
mutatingwebhookconfigurations.v1beta1.admissionregistration.k8s.io     3
customresourcedefinitions.v1beta1.apiextensions.k8s.io                 25
apiservices.v1beta1.apiregistration.k8s.io                             54
leases.v1beta1.coordination.k8s.io                                     4
automatically detected the lowest Kubernetes version 1.22.4
checking for resource APIs to be deprecated in version 1.23.0
discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
discovered worker nodes ["172.20.0.5" "172.20.0.6"]
updating "kube-apiserver" to version "1.23.0"
 > "172.20.0.2": starting update
 > update kube-apiserver: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.3": starting update
 > update kube-apiserver: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.4": starting update
 > update kube-apiserver: v1.22.4 -> 1.23.0
 > skipped in dry-run
updating "kube-controller-manager" to version "1.23.0"
 > "172.20.0.2": starting update
 > update kube-controller-manager: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.3": starting update
 > update kube-controller-manager: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.4": starting update
 > update kube-controller-manager: v1.22.4 -> 1.23.0
 > skipped in dry-run
updating "kube-scheduler" to version "1.23.0"
 > "172.20.0.2": starting update
 > update kube-scheduler: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.3": starting update
 > update kube-scheduler: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.4": starting update
 > update kube-scheduler: v1.22.4 -> 1.23.0
 > skipped in dry-run
updating daemonset "kube-proxy" to version "1.23.0"
skipped in dry-run
updating kubelet to version "1.23.0"
 > "172.20.0.2": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.3": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.4": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.5": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > skipped in dry-run
 > "172.20.0.6": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > skipped in dry-run
updating manifests
 > apply manifest Secret bootstrap-token-3lb63t
 > apply skipped in dry run
 > apply manifest ClusterRoleBinding system-bootstrap-approve-node-client-csr
 > apply skipped in dry run
 > apply manifest ClusterRoleBinding system-bootstrap-node-bootstrapper
 > apply skipped in dry run
 > apply manifest ClusterRoleBinding system-bootstrap-node-renewal
 > apply skipped in dry run
 > apply manifest ClusterRoleBinding system:default-sa
 > apply skipped in dry run
 > apply manifest ClusterRole psp:privileged
 > apply skipped in dry run
 > apply manifest ClusterRoleBinding psp:privileged
 > apply skipped in dry run
 > apply manifest PodSecurityPolicy privileged
 > apply skipped in dry run
 > apply manifest ClusterRole flannel
 > apply skipped in dry run
 > apply manifest ClusterRoleBinding flannel
 > apply skipped in dry run
 > apply manifest ServiceAccount flannel
 > apply skipped in dry run
 > apply manifest ConfigMap kube-flannel-cfg
 > apply skipped in dry run
 > apply manifest DaemonSet kube-flannel
 > apply skipped in dry run
 > apply manifest ServiceAccount kube-proxy
 > apply skipped in dry run
 > apply manifest ClusterRoleBinding kube-proxy
 > apply skipped in dry run
 > apply manifest ServiceAccount coredns
 > apply skipped in dry run
 > apply manifest ClusterRoleBinding system:coredns
 > apply skipped in dry run
 > apply manifest ClusterRole system:coredns
 > apply skipped in dry run
 > apply manifest ConfigMap coredns
 > apply skipped in dry run
 > apply manifest Deployment coredns
 > apply skipped in dry run
 > apply manifest Service kube-dns
 > apply skipped in dry run
 > apply manifest ConfigMap kubeconfig-in-cluster
 > apply skipped in dry run

To upgrade Kubernetes from v1.22.4 to v1.23.0 run:

$ talosctl --nodes <master node> upgrade-k8s --to 1.24.0
automatically detected the lowest Kubernetes version 1.22.4
checking for resource APIs to be deprecated in version 1.23.0
discovered master nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
discovered worker nodes ["172.20.0.5" "172.20.0.6"]
updating "kube-apiserver" to version "1.23.0"
 > "172.20.0.2": starting update
 > update kube-apiserver: v1.22.4 -> 1.23.0
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > update kube-apiserver: v1.22.4 -> 1.23.0
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > update kube-apiserver: v1.22.4 -> 1.23.0
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating "kube-controller-manager" to version "1.23.0"
 > "172.20.0.2": starting update
 > update kube-controller-manager: v1.22.4 -> 1.23.0
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > update kube-controller-manager: v1.22.4 -> 1.23.0
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > update kube-controller-manager: v1.22.4 -> 1.23.0
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating "kube-scheduler" to version "1.23.0"
 > "172.20.0.2": starting update
 > update kube-scheduler: v1.22.4 -> 1.23.0
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > update kube-scheduler: v1.22.4 -> 1.23.0
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for API server state pod update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > update kube-scheduler: v1.22.4 -> 1.23.0
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for API server state pod update
 < "172.20.0.4": successfully updated
updating daemonset "kube-proxy" to version "1.23.0"
updating kubelet to version "1.23.0"
 > "172.20.0.2": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for kubelet restart
 > "172.20.0.2": waiting for node update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > "172.20.0.3": machine configuration patched
 > "172.20.0.3": waiting for kubelet restart
 > "172.20.0.3": waiting for node update
 < "172.20.0.3": successfully updated
 > "172.20.0.4": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > "172.20.0.4": machine configuration patched
 > "172.20.0.4": waiting for kubelet restart
 > "172.20.0.4": waiting for node update
 < "172.20.0.4": successfully updated
 > "172.20.0.5": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > "172.20.0.5": machine configuration patched
 > "172.20.0.5": waiting for kubelet restart
 > "172.20.0.5": waiting for node update
 < "172.20.0.5": successfully updated
 > "172.20.0.6": starting update
 > update kubelet: v1.22.4 -> 1.23.0
 > "172.20.0.6": machine configuration patched
 > "172.20.0.6": waiting for kubelet restart
 > "172.20.0.6": waiting for node update
 < "172.20.0.6": successfully updated
updating manifests
 > apply manifest Secret bootstrap-token-3lb63t
 > apply skipped: nothing to update
 > apply manifest ClusterRoleBinding system-bootstrap-approve-node-client-csr
 > apply skipped: nothing to update
 > apply manifest ClusterRoleBinding system-bootstrap-node-bootstrapper
 > apply skipped: nothing to update
 > apply manifest ClusterRoleBinding system-bootstrap-node-renewal
 > apply skipped: nothing to update
 > apply manifest ClusterRoleBinding system:default-sa
 > apply skipped: nothing to update
 > apply manifest ClusterRole psp:privileged
 > apply skipped: nothing to update
 > apply manifest ClusterRoleBinding psp:privileged
 > apply skipped: nothing to update
 > apply manifest PodSecurityPolicy privileged
 > apply skipped: nothing to update
 > apply manifest ClusterRole flannel
 > apply skipped: nothing to update
 > apply manifest ClusterRoleBinding flannel
 > apply skipped: nothing to update
 > apply manifest ServiceAccount flannel
 > apply skipped: nothing to update
 > apply manifest ConfigMap kube-flannel-cfg
 > apply skipped: nothing to update
 > apply manifest DaemonSet kube-flannel
 > apply skipped: nothing to update
 > apply manifest ServiceAccount kube-proxy
 > apply skipped: nothing to update
 > apply manifest ClusterRoleBinding kube-proxy
 > apply skipped: nothing to update
 > apply manifest ServiceAccount coredns
 > apply skipped: nothing to update
 > apply manifest ClusterRoleBinding system:coredns
 > apply skipped: nothing to update
 > apply manifest ClusterRole system:coredns
 > apply skipped: nothing to update
 > apply manifest ConfigMap coredns
 > apply skipped: nothing to update
 > apply manifest Deployment coredns
 > apply skipped: nothing to update
 > apply manifest Service kube-dns
 > apply skipped: nothing to update
 > apply manifest ConfigMap kubeconfig-in-cluster
 > apply skipped: nothing to update

Script runs in several phases:

  1. Every control plane node machine configuration is patched with new image version for each control plane component. Talos renders new static pod definition on configuration update which is picked up by the kubelet. Script waits for the change to propagate to the API server state.
  2. The script updates kube-proxy daemonset with the new image version.
  3. On every node in the cluster, kubelet version is updated. The script waits for the kubelet service to be restarted, become healthy. Update is verified with the Node resource state.
  4. Kubernetes bootstrap manifests are re-applied to the cluster. The script never deletes any resources from the cluster, they should be deleted manually. Updated bootstrap manifests might come with new Talos version (e.g. CoreDNS version update), or might be result of machine configuration change.

If the script fails for any reason, it can be safely restarted to continue upgrade process from the moment of the failure.

Manual Kubernetes Upgrade

Kubernetes can be upgraded manually as well by following the steps outlined below. They are equivalent to the steps performed by the talosctl upgrade-k8s command.

Kubeconfig

In order to edit the control plane, we will need a working kubectl config. If you don’t already have one, you can get one by running:

talosctl --nodes <master node> kubeconfig

API Server

Patch machine configuration using talosctl patch command:

$ talosctl -n <CONTROL_PLANE_IP_1> patch mc --immediate -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "k8s.gcr.io/kube-apiserver:v1.20.4"}]'
patched mc at the node 172.20.0.2

JSON patch might need to be adjusted if current machine configuration is missing .cluster.apiServer.image key.

Also machine configuration can be edited manually with talosctl -n <IP> edit mc --immediate.

Capture new version of kube-apiserver config with:

$ talosctl -n <CONTROL_PLANE_IP_1> get kcpc kube-apiserver -o yaml
node: 172.20.0.2
metadata:
    namespace: config
    type: KubernetesControlPlaneConfigs.config.talos.dev
    id: kube-apiserver
    version: 5
    phase: running
spec:
    image: k8s.gcr.io/kube-apiserver:v1.20.4
    cloudProvider: ""
    controlPlaneEndpoint: https://172.20.0.1:6443
    etcdServers:
        - https://127.0.0.1:2379
    localPort: 6443
    serviceCIDR: 10.96.0.0/12
    extraArgs: {}
    extraVolumes: []

In this example, new version is 5. Wait for the new pod definition to propagate to the API server state (replace talos-default-master-1 with the node name):

$ kubectl get pod -n kube-system -l k8s-app=kube-apiserver --field-selector spec.nodeName=talos-default-master-1 -o jsonpath='{.items[0].metadata.annotations.talos\.dev/config\-version}'
5

Check that the pod is running:

$ kubectl get pod -n kube-system -l k8s-app=kube-apiserver --field-selector spec.nodeName=talos-default-master-1
NAME                                    READY   STATUS    RESTARTS   AGE
kube-apiserver-talos-default-master-1   1/1     Running   0          16m

Repeat this process for every control plane node, verifying that state got propagated successfully between each node update.

Controller Manager

Patch machine configuration using talosctl patch command:

$ talosctl -n <CONTROL_PLANE_IP_1> patch mc --immediate -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "k8s.gcr.io/kube-controller-manager:v1.20.4"}]'
patched mc at the node 172.20.0.2

JSON patch might need be adjusted if current machine configuration is missing .cluster.controllerManager.image key.

Capture new version of kube-controller-manager config with:

$ talosctl -n <CONTROL_PLANE_IP_1> get kcpc kube-controller-manager -o yaml
node: 172.20.0.2
metadata:
    namespace: config
    type: KubernetesControlPlaneConfigs.config.talos.dev
    id: kube-controller-manager
    version: 3
    phase: running
spec:
    image: k8s.gcr.io/kube-controller-manager:v1.20.4
    cloudProvider: ""
    podCIDR: 10.244.0.0/16
    serviceCIDR: 10.96.0.0/12
    extraArgs: {}
    extraVolumes: []

In this example, new version is 3. Wait for the new pod definition to propagate to the API server state (replace talos-default-master-1 with the node name):

$ kubectl get pod -n kube-system -l k8s-app=kube-controller-manager --field-selector spec.nodeName=talos-default-master-1 -o jsonpath='{.items[0].metadata.annotations.talos\.dev/config\-version}'
3

Check that the pod is running:

$ kubectl get pod -n kube-system -l k8s-app=kube-controller-manager --field-selector spec.nodeName=talos-default-master-1
NAME                                             READY   STATUS    RESTARTS   AGE
kube-controller-manager-talos-default-master-1   1/1     Running   0          35m

Repeat this process for every control plane node, verifying that state got propagated successfully between each node update.

Scheduler

Patch machine configuration using talosctl patch command:

$ talosctl -n <CONTROL_PLANE_IP_1> patch mc --immediate -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "k8s.gcr.io/kube-scheduler:v1.20.4"}]'
patched mc at the node 172.20.0.2

JSON patch might need be adjusted if current machine configuration is missing .cluster.scheduler.image key.

Capture new version of kube-scheduler config with:

$ talosctl -n <CONTROL_PLANE_IP_1> get kcpc kube-scheduler -o yaml
node: 172.20.0.2
metadata:
    namespace: config
    type: KubernetesControlPlaneConfigs.config.talos.dev
    id: kube-scheduler
    version: 3
    phase: running
spec:
    image: k8s.gcr.io/kube-scheduler:v1.20.4
    extraArgs: {}
    extraVolumes: []

In this example, new version is 3. Wait for the new pod definition to propagate to the API server state (replace talos-default-master-1 with the node name):

$ kubectl get pod -n kube-system -l k8s-app=kube-scheduler --field-selector spec.nodeName=talos-default-master-1 -o jsonpath='{.items[0].metadata.annotations.talos\.dev/config\-version}'
3

Check that the pod is running:

$ kubectl get pod -n kube-system -l k8s-app=kube-scheduler --field-selector spec.nodeName=talos-default-master-1
NAME                                    READY   STATUS    RESTARTS   AGE
kube-scheduler-talos-default-master-1   1/1     Running   0          39m

Repeat this process for every control plane node, verifying that state got propagated successfully between each node update.

Proxy

In the proxy’s DaemonSet, change:

kind: DaemonSet
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - name: kube-proxy
          image: k8s.gcr.io/kube-proxy:v1.20.1
      tolerations:
        - ...

to:

kind: DaemonSet
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - name: kube-proxy
          image: k8s.gcr.io/kube-proxy:v1.20.4
      tolerations:
        - ...
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule

To edit the DaemonSet, run:

kubectl edit daemonsets -n kube-system kube-proxy

Bootstrap Manifests

Bootstrap manifests can be retrieved in a format which works for kubectl with the following command:

talosctl -n <master IP> get manifests -o yaml | yq eval-all '.spec | .[] | splitDoc' - > manifests.yaml

Diff the manifests with the cluster:

kubectl diff -f manifests.yaml

Apply the manifests:

kubectl apply -f manifests.yaml

Note: if some boostrap resources were removed, they have to be removed from the cluster manually.

kubelet

For every node, patch machine configuration with new kubelet version, wait for the kubelet to restart with new version:

$ talosctl -n <IP> patch mc --immediate -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/talos-systems/kubelet:v1.23.0"}]'
patched mc at the node 172.20.0.2

Once kubelet restarts with the new configuration, confirm upgrade with kubectl get nodes <name>:

$ kubectl get nodes talos-default-master-1
NAME                     STATUS   ROLES                  AGE    VERSION
talos-default-master-1   Ready    control-plane,master   123m   v1.23.0

7.28 - Upgrading Talos

Talos upgrades are effected by an API call. The talosctl CLI utility will facilitate this.

Video Walkthrough

To see a live demo of this writeup, see the video below:

After Upgrade to 0.14

No actions required.

talosctl Upgrade

To manually upgrade a Talos node, you will specify the node’s IP address and the installer container image for the version of Talos to which you wish to upgrade.

For instance, if your Talos node has the IP address 10.20.30.40 and you want to install the official version v0.14.0, you would enter a command such as:

  $ talosctl upgrade --nodes 10.20.30.40 \
      --image ghcr.io/talos-systems/installer:v0.14.0

There is an option to this command: --preserve, which can be used to explicitly tell Talos to either keep intact its ephemeral data or not. In most cases, it is correct to just let Talos perform its default action. However, if you are running a single-node control-plane, you will want to make sure that --preserve=true.

If Talos fails to run the upgrade, the --stage flag may be used to perform the upgrade after a reboot which is followed by another reboot to upgraded version.

Machine Configuration Changes

Talos 0.14 enables cluster discovery by default for new clusters. Cluster discovery feature won’t be enabled after an upgrade if the feature wasn’t enabled before the upgrade.

7.29 - Virtual (shared) IP

One of the biggest pain points when building a high-availability controlplane is giving clients a single IP or URL at which they can reach any of the controlplane nodes. The most common approaches all require external resources: reverse proxy, load balancer, BGP, and DNS.

Using a “Virtual” IP address, on the other hand, provides high availability without external coordination or resources, so long as the controlplane members share a layer 2 network. In practical terms, this means that they are all connected via a switch, with no router in between them.

The term “virtual” is misleading here. The IP address is real, and it is assigned to an interface. Instead, what actually happens is that the controlplane machines vie for control of the shared IP address. There can be only one owner of the IP address at any given time, but if that owner disappears or becomes non-responsive, another owner will be chosen, and it will take up the mantle: the IP address.

Talos has (as of version 0.9) built-in support for this form of shared IP address, and it can utilize this for both the Kubernetes API server and the Talos endpoint set. Talos uses etcd for elections and leadership (control) of the IP address.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Choose your Shared IP

To begin with, you should choose your shared IP address. It should generally be a reserved, unused IP address in the same subnet as your controlplane nodes. It should not be assigned or assignable by your DHCP server.

For our example, we will assume that the controlplane nodes have the following IP addresses:

  • 192.168.0.10
  • 192.168.0.11
  • 192.168.0.12

We then choose our shared IP to be:

192.168.0.15

Configure your Talos Machines

The shared IP setting is only valid for controlplane nodes.

For the example above, each of the controlplane nodes should have the following Machine Config snippet:

machine:
  network:
    interfaces:
    - interface: eth0
      dhcp: true
      vip:
        ip: 192.168.0.15

Virtual IP’s can also be configured on a VLAN interface.

machine:
  network:
    interfaces:
    - interface: eth0
      dhcp: true
      vip:
        ip: 192.168.0.15
      vlans:
        - vlanId: 100
          dhcp: true
          vip:
            ip: 192.168.1.15

Obviously, for your own environment, the interface and the DHCP setting may differ. You are free to use static addressing (cidr) instead of DHCP.

Caveats

In general, the shared IP should just work. However, since it relies on etcd for elections, the shared IP will not come alive until after you have bootstrapped Kubernetes. In general, this is not a problem, but it does mean that you cannot use the shared IP when issuing the talosctl bootstrap command. Instead, that command will need to target one of the controlplane nodes discretely.

8 - Reference

8.1 - API

Talos gRPC API reference.

Table of Contents

Top

common/common.proto

Data

FieldTypeLabelDescription
metadataMetadata
bytesbytes

DataResponse

FieldTypeLabelDescription
messagesDatarepeated

Empty

FieldTypeLabelDescription
metadataMetadata

EmptyResponse

FieldTypeLabelDescription
messagesEmptyrepeated

Error

FieldTypeLabelDescription
codeCode
messagestring
detailsgoogle.protobuf.Anyrepeated

Metadata

Common metadata message nested in all reply message types

FieldTypeLabelDescription
hostnamestringhostname of the server response comes from (injected by proxy)
errorstringerror is set if request failed to the upstream (rest of response is undefined)
statusgoogle.rpc.Statuserror as gRPC Status

Code

NameNumberDescription
FATAL0
LOCKED1

ContainerDriver

NameNumberDescription
CONTAINERD0
CRI1

File-level Extensions

ExtensionTypeBaseNumberDescription
remove_deprecated_enumstring.google.protobuf.EnumOptions93117Indicates the Talos version when this deprecated enum will be removed from API.
remove_deprecated_enum_valuestring.google.protobuf.EnumValueOptions93117Indicates the Talos version when this deprecated enum value will be removed from API.
remove_deprecated_fieldstring.google.protobuf.FieldOptions93117Indicates the Talos version when this deprecated filed will be removed from API.
remove_deprecated_messagestring.google.protobuf.MessageOptions93117Indicates the Talos version when this deprecated message will be removed from API.
remove_deprecated_methodstring.google.protobuf.MethodOptions93117Indicates the Talos version when this deprecated method will be removed from API.
remove_deprecated_servicestring.google.protobuf.ServiceOptions93117Indicates the Talos version when this deprecated service will be removed from API.

Top

inspect/inspect.proto

ControllerDependencyEdge

FieldTypeLabelDescription
controller_namestring
edge_typeDependencyEdgeType
resource_namespacestring
resource_typestring
resource_idstring

ControllerRuntimeDependenciesResponse

FieldTypeLabelDescription
messagesControllerRuntimeDependencyrepeated

ControllerRuntimeDependency

The ControllerRuntimeDependency message contains the graph of controller-resource dependencies.

FieldTypeLabelDescription
metadatacommon.Metadata
edgesControllerDependencyEdgerepeated

DependencyEdgeType

NameNumberDescription
OUTPUT_EXCLUSIVE0
OUTPUT_SHARED3
INPUT_STRONG1
INPUT_WEAK2
INPUT_DESTROY_READY4

InspectService

The inspect service definition.

InspectService provides auxilary API to inspect OS internals.

Method NameRequest TypeResponse TypeDescription
ControllerRuntimeDependencies.google.protobuf.EmptyControllerRuntimeDependenciesResponse

Top

machine/machine.proto

AddressEvent

AddressEvent reports node endpoints aggregated from k8s.Endpoints and network.Hostname.

FieldTypeLabelDescription
hostnamestring
addressesstringrepeated

ApplyConfiguration

ApplyConfigurationResponse describes the response to a configuration request.

FieldTypeLabelDescription
metadatacommon.Metadata
warningsstringrepeatedConfiguration validation warnings.

ApplyConfigurationRequest

rpc applyConfiguration ApplyConfiguration describes a request to assert a new configuration upon a node.

FieldTypeLabelDescription
databytes
on_rebootbool
immediatebool

ApplyConfigurationResponse

FieldTypeLabelDescription
messagesApplyConfigurationrepeated

Bootstrap

The bootstrap message containing the bootstrap status.

FieldTypeLabelDescription
metadatacommon.Metadata

BootstrapRequest

rpc Bootstrap

FieldTypeLabelDescription
recover_etcdboolEnable etcd recovery from the snapshot.

Snapshot should be uploaded before this call via EtcdRecover RPC. | | recover_skip_hash_check | bool | | Skip hash check on the snapshot (etcd).

Enable this when recovering from data directory copy to skip integrity check. |

BootstrapResponse

FieldTypeLabelDescription
messagesBootstraprepeated

CNIConfig

FieldTypeLabelDescription
namestring
urlsstringrepeated

CPUInfo

FieldTypeLabelDescription
processoruint32
vendor_idstring
cpu_familystring
modelstring
model_namestring
steppingstring
microcodestring
cpu_mhzdouble
cache_sizestring
physical_idstring
siblingsuint32
core_idstring
cpu_coresuint32
apic_idstring
initial_apic_idstring
fpustring
fpu_exceptionstring
cpu_id_leveluint32
wpstring
flagsstringrepeated
bugsstringrepeated
bogo_mipsdouble
cl_flush_sizeuint32
cache_alignmentuint32
address_sizesstring
power_managementstring

CPUInfoResponse

FieldTypeLabelDescription
messagesCPUsInforepeated

CPUStat

FieldTypeLabelDescription
userdouble
nicedouble
systemdouble
idledouble
iowaitdouble
irqdouble
soft_irqdouble
stealdouble
guestdouble
guest_nicedouble

CPUsInfo

FieldTypeLabelDescription
metadatacommon.Metadata
cpu_infoCPUInforepeated

ClusterConfig

FieldTypeLabelDescription
namestring
control_planeControlPlaneConfig
cluster_networkClusterNetworkConfig
allow_scheduling_on_mastersbool

ClusterNetworkConfig

FieldTypeLabelDescription
dns_domainstring
cni_configCNIConfig

ConfigLoadErrorEvent

ConfigLoadErrorEvent is reported when the config loading has failed.

FieldTypeLabelDescription
errorstring

ConfigValidationErrorEvent

ConfigValidationErrorEvent is reported when config validation has failed.

FieldTypeLabelDescription
errorstring

Container

The messages message containing the requested containers.

FieldTypeLabelDescription
metadatacommon.Metadata
containersContainerInforepeated

ContainerInfo

The messages message containing the requested containers.

FieldTypeLabelDescription
namespacestring
idstring
imagestring
piduint32
statusstring
pod_idstring
namestring

ContainersRequest

FieldTypeLabelDescription
namespacestring
drivercommon.ContainerDriverdriver might be default “containerd” or “cri”

ContainersResponse

FieldTypeLabelDescription
messagesContainerrepeated

ControlPlaneConfig

FieldTypeLabelDescription
endpointstring

CopyRequest

CopyRequest describes a request to copy data out of Talos node

Copy produces .tar.gz archive which is streamed back to the caller

FieldTypeLabelDescription
root_pathstringRoot path to start copying data out, it might be either a file or directory

DHCPOptionsConfig

FieldTypeLabelDescription
route_metricuint32

DiskStat

FieldTypeLabelDescription
namestring
read_completeduint64
read_mergeduint64
read_sectorsuint64
read_time_msuint64
write_completeduint64
write_mergeduint64
write_sectorsuint64
write_time_msuint64
io_in_progressuint64
io_time_msuint64
io_time_weighted_msuint64
discard_completeduint64
discard_mergeduint64
discard_sectorsuint64
discard_time_msuint64

DiskStats

FieldTypeLabelDescription
metadatacommon.Metadata
totalDiskStat
devicesDiskStatrepeated

DiskStatsResponse

FieldTypeLabelDescription
messagesDiskStatsrepeated

DiskUsageInfo

DiskUsageInfo describes a file or directory’s information for du command

FieldTypeLabelDescription
metadatacommon.Metadata
namestringName is the name (including prefixed path) of the file or directory
sizeint64Size indicates the number of bytes contained within the file
errorstringError describes any error encountered while trying to read the file information.
relative_namestringRelativeName is the name of the file or directory relative to the RootPath

DiskUsageRequest

DiskUsageRequest describes a request to list disk usage of directories and regular files

FieldTypeLabelDescription
recursion_depthint32RecursionDepth indicates how many levels of subdirectories should be recursed. The default (0) indicates that no limit should be enforced.
allboolAll write sizes for all files, not just directories.
thresholdint64Threshold exclude entries smaller than SIZE if positive, or entries greater than SIZE if negative.
pathsstringrepeatedDiskUsagePaths is the list of directories to calculate disk usage for.

DmesgRequest

dmesg

FieldTypeLabelDescription
followbool
tailbool

EtcdForfeitLeadership

FieldTypeLabelDescription
metadatacommon.Metadata
memberstring

EtcdForfeitLeadershipRequest

EtcdForfeitLeadershipResponse

FieldTypeLabelDescription
messagesEtcdForfeitLeadershiprepeated

EtcdLeaveCluster

FieldTypeLabelDescription
metadatacommon.Metadata

EtcdLeaveClusterRequest

EtcdLeaveClusterResponse

FieldTypeLabelDescription
messagesEtcdLeaveClusterrepeated

EtcdMember

EtcdMember describes a single etcd member.

FieldTypeLabelDescription
iduint64member ID.
hostnamestringhuman-readable name of the member.
peer_urlsstringrepeatedthe list of URLs the member exposes to clients for communication.
client_urlsstringrepeatedthe list of URLs the member exposes to the cluster for communication.
is_learnerboollearner flag

EtcdMemberListRequest

FieldTypeLabelDescription
query_localbool

EtcdMemberListResponse

FieldTypeLabelDescription
messagesEtcdMembersrepeated

EtcdMembers

EtcdMembers contains the list of members registered on the host.

FieldTypeLabelDescription
metadatacommon.Metadata
legacy_membersstringrepeatedlist of member hostnames.
membersEtcdMemberrepeatedthe list of etcd members registered on the node.

EtcdRecover

FieldTypeLabelDescription
metadatacommon.Metadata

EtcdRecoverResponse

FieldTypeLabelDescription
messagesEtcdRecoverrepeated

EtcdRemoveMember

FieldTypeLabelDescription
metadatacommon.Metadata

EtcdRemoveMemberRequest

FieldTypeLabelDescription
memberstring

EtcdRemoveMemberResponse

FieldTypeLabelDescription
messagesEtcdRemoveMemberrepeated

EtcdSnapshotRequest

Event

FieldTypeLabelDescription
metadatacommon.Metadata
datagoogle.protobuf.Any
idstring

EventsRequest

FieldTypeLabelDescription
tail_eventsint32
tail_idstring
tail_secondsint32

FeaturesInfo

FeaturesInfo describes individual Talos features that can be switched on or off.

FieldTypeLabelDescription
rbacboolRBAC is true if role-based access control is enabled.

FileInfo

FileInfo describes a file or directory’s information

FieldTypeLabelDescription
metadatacommon.Metadata
namestringName is the name (including prefixed path) of the file or directory
sizeint64Size indicates the number of bytes contained within the file
modeuint32Mode is the bitmap of UNIX mode/permission flags of the file
modifiedint64Modified indicates the UNIX timestamp at which the file was last modified
is_dirboolIsDir indicates that the file is a directory
errorstringError describes any error encountered while trying to read the file information.
linkstringLink is filled with symlink target
relative_namestringRelativeName is the name of the file or directory relative to the RootPath
uiduint32Owner uid
giduint32Owner gid

GenerateClientConfiguration

FieldTypeLabelDescription
metadatacommon.Metadata
cabytesPEM-encoded CA certificate.
crtbytesPEM-encoded generated client certificate.
keybytesPEM-encoded generated client key.
talosconfigbytesClient configuration (talosconfig) file content.

GenerateClientConfigurationRequest

FieldTypeLabelDescription
rolesstringrepeatedRoles in the generated client certificate.
crt_ttlgoogle.protobuf.DurationClient certificate TTL.

GenerateClientConfigurationResponse

FieldTypeLabelDescription
messagesGenerateClientConfigurationrepeated

GenerateConfiguration

GenerateConfiguration describes the response to a generate configuration request.

FieldTypeLabelDescription
metadatacommon.Metadata
databytesrepeated
talosconfigbytes

GenerateConfigurationRequest

GenerateConfigurationRequest describes a request to generate a new configuration on a node.

FieldTypeLabelDescription
config_versionstring
cluster_configClusterConfig
machine_configMachineConfig
override_timegoogle.protobuf.Timestamp

GenerateConfigurationResponse

FieldTypeLabelDescription
messagesGenerateConfigurationrepeated

Hostname

FieldTypeLabelDescription
metadatacommon.Metadata
hostnamestring

HostnameResponse

FieldTypeLabelDescription
messagesHostnamerepeated

InstallConfig

FieldTypeLabelDescription
install_diskstring
install_imagestring

ListRequest

ListRequest describes a request to list the contents of a directory.

FieldTypeLabelDescription
rootstringRoot indicates the root directory for the list. If not indicated, ‘/’ is presumed.
recurseboolRecurse indicates that subdirectories should be recursed.
recursion_depthint32RecursionDepth indicates how many levels of subdirectories should be recursed. The default (0) indicates that no limit should be enforced.
typesListRequest.TyperepeatedTypes indicates what file type should be returned. If not indicated, all files will be returned.

LoadAvg

FieldTypeLabelDescription
metadatacommon.Metadata
load1double
load5double
load15double

LoadAvgResponse

FieldTypeLabelDescription
messagesLoadAvgrepeated

LogsRequest

rpc logs The request message containing the process name.

FieldTypeLabelDescription
namespacestring
idstring
drivercommon.ContainerDriverdriver might be default “containerd” or “cri”
followbool
tail_linesint32

MachineConfig

FieldTypeLabelDescription
typeMachineConfig.MachineType
install_configInstallConfig
network_configNetworkConfig
kubernetes_versionstring

MemInfo

FieldTypeLabelDescription
memtotaluint64
memfreeuint64
memavailableuint64
buffersuint64
cacheduint64
swapcacheduint64
activeuint64
inactiveuint64
activeanonuint64
inactiveanonuint64
activefileuint64
inactivefileuint64
unevictableuint64
mlockeduint64
swaptotaluint64
swapfreeuint64
dirtyuint64
writebackuint64
anonpagesuint64
mappeduint64
shmemuint64
slabuint64
sreclaimableuint64
sunreclaimuint64
kernelstackuint64
pagetablesuint64
nfsunstableuint64
bounceuint64
writebacktmpuint64
commitlimituint64
committedasuint64
vmalloctotaluint64
vmallocuseduint64
vmallocchunkuint64
hardwarecorrupteduint64
anonhugepagesuint64
shmemhugepagesuint64
shmempmdmappeduint64
cmatotaluint64
cmafreeuint64
hugepagestotaluint64
hugepagesfreeuint64
hugepagesrsvduint64
hugepagessurpuint64
hugepagesizeuint64
directmap4kuint64
directmap2muint64
directmap1guint64

Memory

FieldTypeLabelDescription
metadatacommon.Metadata
meminfoMemInfo

MemoryResponse

FieldTypeLabelDescription
messagesMemoryrepeated

MountStat

The messages message containing the requested processes.

FieldTypeLabelDescription
filesystemstring
sizeuint64
availableuint64
mounted_onstring

Mounts

The messages message containing the requested df stats.

FieldTypeLabelDescription
metadatacommon.Metadata
statsMountStatrepeated

MountsResponse

FieldTypeLabelDescription
messagesMountsrepeated

NetDev

FieldTypeLabelDescription
namestring
rx_bytesuint64
rx_packetsuint64
rx_errorsuint64
rx_droppeduint64
rx_fifouint64
rx_frameuint64
rx_compresseduint64
rx_multicastuint64
tx_bytesuint64
tx_packetsuint64
tx_errorsuint64
tx_droppeduint64
tx_fifouint64
tx_collisionsuint64
tx_carrieruint64
tx_compresseduint64

NetworkConfig

FieldTypeLabelDescription
hostnamestring
interfacesNetworkDeviceConfigrepeated

NetworkDeviceConfig

FieldTypeLabelDescription
interfacestring
cidrstring
mtuint32
dhcpbool
ignorebool
dhcp_optionsDHCPOptionsConfig
routesRouteConfigrepeated

NetworkDeviceStats

FieldTypeLabelDescription
metadatacommon.Metadata
totalNetDev
devicesNetDevrepeated

NetworkDeviceStatsResponse

FieldTypeLabelDescription
messagesNetworkDeviceStatsrepeated

PhaseEvent

FieldTypeLabelDescription
phasestring
actionPhaseEvent.Action

PlatformInfo

FieldTypeLabelDescription
namestring
modestring

Process

FieldTypeLabelDescription
metadatacommon.Metadata
processesProcessInforepeated

ProcessInfo

FieldTypeLabelDescription
pidint32
ppidint32
statestring
threadsint32
cpu_timedouble
virtual_memoryuint64
resident_memoryuint64
commandstring
executablestring
argsstring

ProcessesResponse

rpc processes

FieldTypeLabelDescription
messagesProcessrepeated

ReadRequest

FieldTypeLabelDescription
pathstring

Reboot

The reboot message containing the reboot status.

FieldTypeLabelDescription
metadatacommon.Metadata

RebootRequest

rpc reboot

FieldTypeLabelDescription
modeRebootRequest.Mode

RebootResponse

FieldTypeLabelDescription
messagesRebootrepeated

Reset

The reset message containing the restart status.

FieldTypeLabelDescription
metadatacommon.Metadata

ResetPartitionSpec

rpc reset

FieldTypeLabelDescription
labelstring
wipebool

ResetRequest

FieldTypeLabelDescription
gracefulboolGraceful indicates whether node should leave etcd before the upgrade, it also enforces etcd checks before leaving.
rebootboolReboot indicates whether node should reboot or halt after resetting.
system_partitions_to_wipeResetPartitionSpecrepeatedSystem_partitions_to_wipe lists specific system disk partitions to be reset (wiped). If system_partitions_to_wipe is empty, all the partitions are erased.

ResetResponse

FieldTypeLabelDescription
messagesResetrepeated

Restart

FieldTypeLabelDescription
metadatacommon.Metadata

RestartEvent

FieldTypeLabelDescription
cmdint64

RestartRequest

rpc restart The request message containing the process to restart.

FieldTypeLabelDescription
namespacestring
idstring
drivercommon.ContainerDriverdriver might be default “containerd” or “cri”

RestartResponse

The messages message containing the restart status.

FieldTypeLabelDescription
messagesRestartrepeated

Rollback

FieldTypeLabelDescription
metadatacommon.Metadata

RollbackRequest

rpc rollback

RollbackResponse

FieldTypeLabelDescription
messagesRollbackrepeated

RouteConfig

FieldTypeLabelDescription
networkstring
gatewaystring
metricuint32

SequenceEvent

rpc events

FieldTypeLabelDescription
sequencestring
actionSequenceEvent.Action
errorcommon.Error

ServiceEvent

FieldTypeLabelDescription
msgstring
statestring
tsgoogle.protobuf.Timestamp

ServiceEvents

FieldTypeLabelDescription
eventsServiceEventrepeated

ServiceHealth

FieldTypeLabelDescription
unknownbool
healthybool
last_messagestring
last_changegoogle.protobuf.Timestamp

ServiceInfo

FieldTypeLabelDescription
idstring
statestring
eventsServiceEvents
healthServiceHealth

ServiceList

rpc servicelist

FieldTypeLabelDescription
metadatacommon.Metadata
servicesServiceInforepeated

ServiceListResponse

FieldTypeLabelDescription
messagesServiceListrepeated

ServiceRestart

FieldTypeLabelDescription
metadatacommon.Metadata
respstring

ServiceRestartRequest

FieldTypeLabelDescription
idstring

ServiceRestartResponse

FieldTypeLabelDescription
messagesServiceRestartrepeated

ServiceStart

FieldTypeLabelDescription
metadatacommon.Metadata
respstring

ServiceStartRequest

rpc servicestart

FieldTypeLabelDescription
idstring

ServiceStartResponse

FieldTypeLabelDescription
messagesServiceStartrepeated

ServiceStateEvent

FieldTypeLabelDescription
servicestring
actionServiceStateEvent.Action
messagestring
healthServiceHealth

ServiceStop

FieldTypeLabelDescription
metadatacommon.Metadata
respstring

ServiceStopRequest

FieldTypeLabelDescription
idstring

ServiceStopResponse

FieldTypeLabelDescription
messagesServiceStoprepeated

Shutdown

rpc shutdown The messages message containing the shutdown status.

FieldTypeLabelDescription
metadatacommon.Metadata

ShutdownResponse

FieldTypeLabelDescription
messagesShutdownrepeated

SoftIRQStat

FieldTypeLabelDescription
hiuint64
timeruint64
net_txuint64
net_rxuint64
blockuint64
block_io_polluint64
taskletuint64
scheduint64
hrtimeruint64
rcuuint64

Stat

The messages message containing the requested stat.

FieldTypeLabelDescription
namespacestring
idstring
memory_usageuint64
cpu_usageuint64
pod_idstring
namestring

Stats

The messages message containing the requested stats.

FieldTypeLabelDescription
metadatacommon.Metadata
statsStatrepeated

StatsRequest

The request message containing the containerd namespace.

FieldTypeLabelDescription
namespacestring
drivercommon.ContainerDriverdriver might be default “containerd” or “cri”

StatsResponse

FieldTypeLabelDescription
messagesStatsrepeated

SystemStat

FieldTypeLabelDescription
metadatacommon.Metadata
boot_timeuint64
cpu_totalCPUStat
cpuCPUStatrepeated
irq_totaluint64
irquint64repeated
context_switchesuint64
process_createduint64
process_runninguint64
process_blockeduint64
soft_irq_totaluint64
soft_irqSoftIRQStat

SystemStatResponse

FieldTypeLabelDescription
messagesSystemStatrepeated

TaskEvent

FieldTypeLabelDescription
taskstring
actionTaskEvent.Action

Upgrade

FieldTypeLabelDescription
metadatacommon.Metadata
ackstring

UpgradeRequest

rpc upgrade

FieldTypeLabelDescription
imagestring
preservebool
stagebool
forcebool

UpgradeResponse

FieldTypeLabelDescription
messagesUpgraderepeated

Version

FieldTypeLabelDescription
metadatacommon.Metadata
versionVersionInfo
platformPlatformInfo
featuresFeaturesInfoFeatures describe individual Talos features that can be switched on or off.

VersionInfo

FieldTypeLabelDescription
tagstring
shastring
builtstring
go_versionstring
osstring
archstring

VersionResponse

FieldTypeLabelDescription
messagesVersionrepeated

ListRequest.Type

File type.

NameNumberDescription
REGULAR0Regular file (not directory, symlink, etc).
DIRECTORY1Directory.
SYMLINK2Symbolic link.

MachineConfig.MachineType

NameNumberDescription
TYPE_UNKNOWN0
TYPE_INIT1
TYPE_CONTROL_PLANE2
TYPE_WORKER3
TYPE_JOIN3Deprecated alias for TYPE_WORKER. It will be removed in v0.15.

PhaseEvent.Action

NameNumberDescription
START0
STOP1

RebootRequest.Mode

NameNumberDescription
DEFAULT0
POWERCYCLE1

SequenceEvent.Action

NameNumberDescription
NOOP0
START1
STOP2

ServiceStateEvent.Action

NameNumberDescription
INITIALIZED0
PREPARING1
WAITING2
RUNNING3
STOPPING4
FINISHED5
FAILED6
SKIPPED7

TaskEvent.Action

NameNumberDescription
START0
STOP1

MachineService

The machine service definition.

Method NameRequest TypeResponse TypeDescription
ApplyConfigurationApplyConfigurationRequestApplyConfigurationResponse
BootstrapBootstrapRequestBootstrapResponseBootstrap method makes control plane node enter etcd bootstrap mode.

Node aborts etcd join sequence and creates single-node etcd cluster.

If recover_etcd argument is specified, etcd is recovered from a snapshot uploaded with EtcdRecover. | | Containers | ContainersRequest | ContainersResponse | | | Copy | CopyRequest | .common.Data stream | | | CPUInfo | .google.protobuf.Empty | CPUInfoResponse | | | DiskStats | .google.protobuf.Empty | DiskStatsResponse | | | Dmesg | DmesgRequest | .common.Data stream | | | Events | EventsRequest | Event stream | | | EtcdMemberList | EtcdMemberListRequest | EtcdMemberListResponse | | | EtcdRemoveMember | EtcdRemoveMemberRequest | EtcdRemoveMemberResponse | | | EtcdLeaveCluster | EtcdLeaveClusterRequest | EtcdLeaveClusterResponse | | | EtcdForfeitLeadership | EtcdForfeitLeadershipRequest | EtcdForfeitLeadershipResponse | | | EtcdRecover | .common.Data stream | EtcdRecoverResponse | EtcdRecover method uploads etcd data snapshot created with EtcdSnapshot to the node.

Snapshot can be later used to recover the cluster via Bootstrap method. | | EtcdSnapshot | EtcdSnapshotRequest | .common.Data stream | EtcdSnapshot method creates etcd data snapshot (backup) from the local etcd instance and streams it back to the client.

This method is available only on control plane nodes (which run etcd). | | GenerateConfiguration | GenerateConfigurationRequest | GenerateConfigurationResponse | | | Hostname | .google.protobuf.Empty | HostnameResponse | | | Kubeconfig | .google.protobuf.Empty | .common.Data stream | | | List | ListRequest | FileInfo stream | | | DiskUsage | DiskUsageRequest | DiskUsageInfo stream | | | LoadAvg | .google.protobuf.Empty | LoadAvgResponse | | | Logs | LogsRequest | .common.Data stream | | | Memory | .google.protobuf.Empty | MemoryResponse | | | Mounts | .google.protobuf.Empty | MountsResponse | | | NetworkDeviceStats | .google.protobuf.Empty | NetworkDeviceStatsResponse | | | Processes | .google.protobuf.Empty | ProcessesResponse | | | Read | ReadRequest | .common.Data stream | | | Reboot | RebootRequest | RebootResponse | | | Restart | RestartRequest | RestartResponse | | | Rollback | RollbackRequest | RollbackResponse | | | Reset | ResetRequest | ResetResponse | | | ServiceList | .google.protobuf.Empty | ServiceListResponse | | | ServiceRestart | ServiceRestartRequest | ServiceRestartResponse | | | ServiceStart | ServiceStartRequest | ServiceStartResponse | | | ServiceStop | ServiceStopRequest | ServiceStopResponse | | | Shutdown | .google.protobuf.Empty | ShutdownResponse | | | Stats | StatsRequest | StatsResponse | | | SystemStat | .google.protobuf.Empty | SystemStatResponse | | | Upgrade | UpgradeRequest | UpgradeResponse | | | Version | .google.protobuf.Empty | VersionResponse | | | GenerateClientConfiguration | GenerateClientConfigurationRequest | GenerateClientConfigurationResponse | GenerateClientConfiguration generates talosctl client configuration (talosconfig). |

Top

resource/resource.proto

Get

The GetResponse message contains the Resource returned.

FieldTypeLabelDescription
metadatacommon.Metadata
definitionResource
resourceResource

GetRequest

rpc Get

FieldTypeLabelDescription
namespacestring
typestring
idstring

GetResponse

FieldTypeLabelDescription
messagesGetrepeated

ListRequest

rpc List The ListResponse message contains the Resource returned.

FieldTypeLabelDescription
namespacestring
typestring

ListResponse

FieldTypeLabelDescription
metadatacommon.Metadata
definitionResource
resourceResource

Metadata

FieldTypeLabelDescription
namespacestring
typestring
idstring
versionstring
ownerstring
phasestring
createdgoogle.protobuf.Timestamp
updatedgoogle.protobuf.Timestamp
finalizersstringrepeated

Resource

FieldTypeLabelDescription
metadataMetadata
specSpec

Spec

FieldTypeLabelDescription
yamlbytes

WatchRequest

rpc Watch The WatchResponse message contains the Resource returned.

FieldTypeLabelDescription
namespacestring
typestring
idstring
tail_eventsuint32

WatchResponse

FieldTypeLabelDescription
metadatacommon.Metadata
event_typeEventType
definitionResource
resourceResource

EventType

NameNumberDescription
CREATED0
UPDATED1
DESTROYED2

ResourceService

The resource service definition.

ResourceService provides user-facing API for the Talos resources.

Method NameRequest TypeResponse TypeDescription
GetGetRequestGetResponse
ListListRequestListResponse stream
WatchWatchRequestWatchResponse stream

Top

security/security.proto

CertificateRequest

The request message containing the process name.

FieldTypeLabelDescription
csrbytes

CertificateResponse

The response message containing the requested logs.

FieldTypeLabelDescription
cabytes
crtbytes

SecurityService

The security service definition.

Method NameRequest TypeResponse TypeDescription
CertificateCertificateRequestCertificateResponse

Top

storage/storage.proto

Disk

Disk represents a disk.

FieldTypeLabelDescription
sizeuint64Size indicates the disk size in bytes.
modelstringModel idicates the disk model.
device_namestringDeviceName indicates the disk name (e.g. sda).
namestringName as in /sys/block/<dev>/device/name.
serialstringSerial as in /sys/block/<dev>/device/serial.
modaliasstringModalias as in /sys/block/<dev>/device/modalias.
uuidstringUuid as in /sys/block/<dev>/device/uuid.
wwidstringWwid as in /sys/block/<dev>/device/wwid.
typeDisk.DiskTypeType is a type of the disk: nvme, ssd, hdd, sd card.

Disks

DisksResponse represents the response of the Disks RPC.

FieldTypeLabelDescription
metadatacommon.Metadata
disksDiskrepeated

DisksResponse

FieldTypeLabelDescription
messagesDisksrepeated

Disk.DiskType

NameNumberDescription
UNKNOWN0
SSD1
HDD2
NVME3
SD4

StorageService

StorageService represents the storage service.

Method NameRequest TypeResponse TypeDescription
Disks.google.protobuf.EmptyDisksResponse

Top

time/time.proto

Time

FieldTypeLabelDescription
metadatacommon.Metadata
serverstring
localtimegoogle.protobuf.Timestamp
remotetimegoogle.protobuf.Timestamp

TimeRequest

The response message containing the ntp server

FieldTypeLabelDescription
serverstring

TimeResponse

The response message containing the ntp server, time, and offset

FieldTypeLabelDescription
messagesTimerepeated

TimeService

The time service definition.

Method NameRequest TypeResponse TypeDescription
Time.google.protobuf.EmptyTimeResponse
TimeCheckTimeRequestTimeResponse

Scalar Value Types

.proto TypeNotesC++JavaPythonGoC#PHPRuby
doubledoubledoublefloatfloat64doublefloatFloat
floatfloatfloatfloatfloat32floatfloatFloat
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32intintegerBignum or Fixnum (as required)
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64longint/longint64longinteger/stringBignum
uint32Uses variable-length encoding.uint32intint/longuint32uintintegerBignum or Fixnum (as required)
uint64Uses variable-length encoding.uint64longint/longuint64ulonginteger/stringBignum or Fixnum (as required)
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32intintegerBignum or Fixnum (as required)
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64longint/longint64longinteger/stringBignum
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32uintintegerBignum or Fixnum (as required)
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64longint/longuint64ulonginteger/stringBignum
sfixed32Always four bytes.int32intintint32intintegerBignum or Fixnum (as required)
sfixed64Always eight bytes.int64longint/longint64longinteger/stringBignum
boolboolbooleanbooleanboolboolbooleanTrueClass/FalseClass
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringStringstr/unicodestringstringstringString (UTF-8)
bytesMay contain any arbitrary sequence of bytes.stringByteStringstr[]byteByteStringstringString (ASCII-8BIT)

8.2 - CLI

talosctl apply-config

Apply a new configuration to a node

talosctl apply-config [flags]

Options

      --cert-fingerprint strings   list of server certificate fingeprints to accept (defaults to no check)
  -f, --file string                the filename of the updated configuration
  -h, --help                       help for apply-config
      --immediate                  apply the config immediately (without a reboot)
  -i, --insecure                   apply the config using the insecure (encrypted with no auth) maintenance service
      --interactive                apply the config using text based interactive mode
      --on-reboot                  apply the config on reboot

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl bootstrap

Bootstrap the etcd cluster on the specified node.

Synopsis

When Talos cluster is created etcd service on control plane nodes enter the join loop waiting to join etcd peers from other control plane nodes. One node should be picked as the boostrap node. When boostrap command is issued, the node aborts join process and bootstraps etcd cluster as a single node cluster. Other control plane nodes will join etcd cluster once Kubernetes is boostrapped on the bootstrap node.

This command should not be used when “init” type node are used.

Talos etcd cluster can be recovered from a known snapshot with ‘–recover-from=’ flag.

talosctl bootstrap [flags]

Options

  -h, --help                      help for bootstrap
      --recover-from string       recover etcd cluster from the snapshot
      --recover-skip-hash-check   skip integrity check when recovering etcd (use when recovering from data directory copy)

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl cluster create

Creates a local docker-based or QEMU-based kubernetes cluster

talosctl cluster create [flags]

Options

      --arch string                             cluster architecture (default "amd64")
      --bad-rtc                                 launch VM with bad RTC state (QEMU only)
      --cidr string                             CIDR of the cluster network (IPv4, ULA network for IPv6 is derived in automated way) (default "10.5.0.0/24")
      --cni-bin-path strings                    search path for CNI binaries (VM only) (default [/home/user/.talos/cni/bin])
      --cni-bundle-url string                   URL to download CNI bundle from (VM only) (default "https://github.com/siderolabs/talos/releases/download/v0.14.0-alpha.2/talosctl-cni-bundle-${ARCH}.tar.gz")
      --cni-cache-dir string                    CNI cache directory path (VM only) (default "/home/user/.talos/cni/cache")
      --cni-conf-dir string                     CNI config directory path (VM only) (default "/home/user/.talos/cni/conf.d")
      --config-patch string                     patch generated machineconfigs (applied to all node types)
      --config-patch-control-plane string       patch generated machineconfigs (applied to 'init' and 'controlplane' types)
      --config-patch-worker string              patch generated machineconfigs (applied to 'worker' type)
      --cpus string                             the share of CPUs as fraction (each container/VM) (default "2.0")
      --crashdump                               print debug crashdump to stderr when cluster startup fails
      --custom-cni-url string                   install custom CNI from the URL (Talos cluster)
      --disk int                                default limit on disk size in MB (each VM) (default 6144)
      --disk-image-path string                  disk image to use
      --dns-domain string                       the dns domain to use for cluster (default "cluster.local")
      --docker-host-ip string                   Host IP to forward exposed ports to (Docker provisioner only) (default "0.0.0.0")
      --encrypt-ephemeral                       enable ephemeral partition encryption
      --encrypt-state                           enable state partition encryption
      --endpoint string                         use endpoint instead of provider defaults
  -p, --exposed-ports string                    Comma-separated list of ports/protocols to expose on init node. Ex -p <hostPort>:<containerPort>/<protocol (tcp or udp)> (Docker provisioner only)
      --extra-boot-kernel-args string           add extra kernel args to the initial boot from vmlinuz and initramfs (QEMU only)
  -h, --help                                    help for create
      --image string                            the image to use (default "ghcr.io/talos-systems/talos:latest")
      --init-node-as-endpoint                   use init node as endpoint instead of any load balancer endpoint
      --initrd-path string                      initramfs image to use (default "_out/initramfs-${ARCH}.xz")
  -i, --input-dir string                        location of pre-generated config files
      --install-image string                    the installer image to use (default "ghcr.io/talos-systems/installer:latest")
      --ipv4                                    enable IPv4 network in the cluster (default true)
      --ipv6                                    enable IPv6 network in the cluster (QEMU provisioner only)
      --iso-path string                         the ISO path to use for the initial boot (VM only)
      --kubernetes-version string               desired kubernetes version to run (default "1.23.1")
      --masters int                             the number of masters to create (default 1)
      --memory int                              the limit on memory usage in MB (each container/VM) (default 2048)
      --mtu int                                 MTU of the cluster network (default 1500)
      --nameservers strings                     list of nameservers to use (default [8.8.8.8,1.1.1.1,2001:4860:4860::8888,2606:4700:4700::1111])
      --registry-insecure-skip-verify strings   list of registry hostnames to skip TLS verification for
      --registry-mirror strings                 list of registry mirrors to use in format: <registry host>=<mirror URL>
      --skip-injecting-config                   skip injecting config from embedded metadata server, write config files to current directory
      --skip-kubeconfig                         skip merging kubeconfig from the created cluster
      --talos-version string                    the desired Talos version to generate config for (if not set, defaults to image version)
      --use-vip                                 use a virtual IP for the controlplane endpoint instead of the loadbalancer
      --user-disk strings                       list of disks to create for each VM in format: <mount_point1>:<size1>:<mount_point2>:<size2>
      --vmlinuz-path string                     the compressed kernel image to use (default "_out/vmlinuz-${ARCH}")
      --wait                                    wait for the cluster to be ready before returning (default true)
      --wait-timeout duration                   timeout to wait for the cluster to be ready (default 20m0s)
      --wireguard-cidr string                   CIDR of the wireguard network
      --with-apply-config                       enable apply config when the VM is starting in maintenance mode
      --with-bootloader                         enable bootloader to load kernel and initramfs from disk image after install (default true)
      --with-cluster-discovery                  enable cluster discovery (default true)
      --with-debug                              enable debug in Talos config to send service logs to the console
      --with-init-node                          create the cluster with an init node
      --with-kubespan                           enable KubeSpan system
      --with-uefi                               enable UEFI on x86_64 architecture (always enabled for arm64)
      --workers int                             the number of workers to create (default 1)

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
      --name string          the name of the cluster (default "talos-default")
  -n, --nodes strings        target the specified nodes
      --provisioner string   Talos cluster provisioner to use (default "docker")
      --state string         directory path to store cluster state (default "/home/user/.talos/clusters")
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl cluster - A collection of commands for managing local docker-based or QEMU-based clusters

talosctl cluster destroy

Destroys a local docker-based or firecracker-based kubernetes cluster

talosctl cluster destroy [flags]

Options

  -h, --help   help for destroy

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
      --name string          the name of the cluster (default "talos-default")
  -n, --nodes strings        target the specified nodes
      --provisioner string   Talos cluster provisioner to use (default "docker")
      --state string         directory path to store cluster state (default "/home/user/.talos/clusters")
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl cluster - A collection of commands for managing local docker-based or QEMU-based clusters

talosctl cluster show

Shows info about a local provisioned kubernetes cluster

talosctl cluster show [flags]

Options

  -h, --help   help for show

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
      --name string          the name of the cluster (default "talos-default")
  -n, --nodes strings        target the specified nodes
      --provisioner string   Talos cluster provisioner to use (default "docker")
      --state string         directory path to store cluster state (default "/home/user/.talos/clusters")
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl cluster - A collection of commands for managing local docker-based or QEMU-based clusters

talosctl cluster

A collection of commands for managing local docker-based or QEMU-based clusters

Options

  -h, --help                 help for cluster
      --name string          the name of the cluster (default "talos-default")
      --provisioner string   Talos cluster provisioner to use (default "docker")
      --state string         directory path to store cluster state (default "/home/user/.talos/clusters")

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl completion

Output shell completion code for the specified shell (bash, fish or zsh)

Synopsis

Output shell completion code for the specified shell (bash, fish or zsh). The shell code must be evaluated to provide interactive completion of talosctl commands. This can be done by sourcing it from the .bash_profile.

Note for zsh users: [1] zsh completions are only supported in versions of zsh >= 5.2

talosctl completion SHELL [flags]

Examples

# Installing bash completion on macOS using homebrew
## If running Bash 3.2 included with macOS
	brew install bash-completion
## or, if running Bash 4.1+
	brew install bash-completion@2
## If talosctl is installed via homebrew, this should start working immediately.
## If you've installed via other means, you may need add the completion to your completion directory
	talosctl completion bash > $(brew --prefix)/etc/bash_completion.d/talosctl

# Installing bash completion on Linux
## If bash-completion is not installed on Linux, please install the 'bash-completion' package
## via your distribution's package manager.
## Load the talosctl completion code for bash into the current shell
	source <(talosctl completion bash)
## Write bash completion code to a file and source if from .bash_profile
	talosctl completion bash > ~/.talos/completion.bash.inc
	printf "
		# talosctl shell completion
		source '$HOME/.talos/completion.bash.inc'
		" >> $HOME/.bash_profile
	source $HOME/.bash_profile
# Load the talosctl completion code for fish[1] into the current shell
	talosctl completion fish | source
# Set the talosctl completion code for fish[1] to autoload on startup
    talosctl completion fish > ~/.config/fish/completions/talosctl.fish
# Load the talosctl completion code for zsh[1] into the current shell
	source <(talosctl completion zsh)
# Set the talosctl completion code for zsh[1] to autoload on startup
    talosctl completion zsh > "${fpath[1]}/_talosctl"

Options

  -h, --help   help for completion

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl config add

Add a new context

talosctl config add <context> [flags]

Options

      --ca string    the path to the CA certificate
      --crt string   the path to the certificate
  -h, --help         help for add
      --key string   the path to the key

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl config context

Set the current context

talosctl config context <context> [flags]

Options

  -h, --help   help for context

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl config contexts

List defined contexts

talosctl config contexts [flags]

Options

  -h, --help   help for contexts

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl config endpoint

Set the endpoint(s) for the current context

talosctl config endpoint <endpoint>... [flags]

Options

  -h, --help   help for endpoint

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl config info

Show information about the current context

talosctl config info [flags]

Options

  -h, --help   help for info

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl config merge

Merge additional contexts from another client configuration file

Synopsis

Contexts with the same name are renamed while merging configs.

talosctl config merge <from> [flags]

Options

  -h, --help   help for merge

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl config new

Generate a new client configuration file

talosctl config new [<path>] [flags]

Options

      --crt-ttl duration   certificate TTL (default 87600h0m0s)
  -h, --help               help for new
      --roles strings      roles (default [os:admin])

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl config node

Set the node(s) for the current context

talosctl config node <endpoint>... [flags]

Options

  -h, --help   help for node

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl config

Manage the client configuration file (talosconfig)

Options

  -h, --help   help for config

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl conformance kubernetes

Run Kubernetes conformance tests

talosctl conformance kubernetes [flags]

Options

  -h, --help          help for kubernetes
      --mode string   conformance test mode: [fast, certified] (default "fast")

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl conformance

Run conformance tests

Options

  -h, --help   help for conformance

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl containers

List containers

talosctl containers [flags]

Options

  -h, --help         help for containers
  -k, --kubernetes   use the k8s.io containerd namespace

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl copy

Copy data out from the node

Synopsis

Creates an .tar.gz archive at the node starting at and streams it back to the client.

If ‘-’ is given for , archive is written to stdout. Otherwise archive is extracted to which should be an empty directory or talosctl creates a directory if doesn’t exist. Command doesn’t preserve ownership and access mode for the files in extract mode, while streamed .tar archive captures ownership and permission bits.

talosctl copy <src-path> -|<local-path> [flags]

Options

  -h, --help   help for copy

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl dashboard

Cluster dashboard with real-time metrics

Synopsis

Provide quick UI to navigate through node real-time metrics.

Keyboard shortcuts:

  • h, : switch one node to the left
  • l, : switch one node to the right
  • j, : scroll process list down
  • k, : scroll process list up
  • : scroll process list half page down
  • : scroll process list half page up
  • : scroll process list one page down
  • : scroll process list one page up
talosctl dashboard [flags]

Options

  -h, --help                       help for dashboard
  -d, --update-interval duration   interval between updates (default 3s)

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl disks

Get the list of disks from /sys/block on the machine

talosctl disks [flags]

Options

  -h, --help       help for disks
  -i, --insecure   get disks using the insecure (encrypted with no auth) maintenance service

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl dmesg

Retrieve kernel logs

talosctl dmesg [flags]

Options

  -f, --follow   specify if the kernel log should be streamed
  -h, --help     help for dmesg
      --tail     specify if only new messages should be sent (makes sense only when combined with --follow)

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl edit

Edit a resource from the default editor.

Synopsis

The edit command allows you to directly edit any API resource you can retrieve via the command line tools.

It will open the editor defined by your TALOS_EDITOR, or EDITOR environment variables, or fall back to ‘vi’ for Linux or ’notepad’ for Windows.

talosctl edit <type> [<id>] [flags]

Options

  -h, --help               help for edit
      --immediate          apply the change immediately (without a reboot)
      --namespace string   resource namespace (default is to use default namespace per resource)
      --on-reboot          apply the change on next reboot

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl etcd forfeit-leadership

Tell node to forfeit etcd cluster leadership

talosctl etcd forfeit-leadership [flags]

Options

  -h, --help   help for forfeit-leadership

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl etcd leave

Tell nodes to leave etcd cluster

talosctl etcd leave [flags]

Options

  -h, --help   help for leave

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl etcd members

Get the list of etcd cluster members

talosctl etcd members [flags]

Options

  -h, --help   help for members

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl etcd remove-member

Remove the node from etcd cluster

Synopsis

Use this command only if you want to remove a member which is in broken state. If there is no access to the node, or the node can’t access etcd to call etcd leave. Always prefer etcd leave over this command.

talosctl etcd remove-member <hostname> [flags]

Options

  -h, --help   help for remove-member

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl etcd snapshot

Stream snapshot of the etcd node to the path.

talosctl etcd snapshot <path> [flags]

Options

  -h, --help   help for snapshot

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl etcd

Manage etcd

Options

  -h, --help   help for etcd

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl events

Stream runtime events

talosctl events [flags]

Options

      --duration duration   show events for the past duration interval (one second resolution, default is to show no history)
  -h, --help                help for events
      --since string        show events after the specified event ID (default is to show no history)
      --tail int32          show specified number of past events (use -1 to show full history, default is to show no history)

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl gen ca

Generates a self-signed X.509 certificate authority

talosctl gen ca [flags]

Options

  -h, --help                  help for ca
      --hours int             the hours from now on which the certificate validity period ends (default 87600)
      --organization string   X.509 distinguished name for the Organization
      --rsa                   generate in RSA format

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl gen config

Generates a set of configuration files for Talos cluster

Synopsis

The cluster endpoint is the URL for the Kubernetes API. If you decide to use a control plane node, common in a single node control plane setup, use port 6443 as this is the port that the API server binds to on every control plane node. For an HA setup, usually involving a load balancer, use the IP and port of the load balancer.

talosctl gen config <cluster name> <cluster endpoint> [flags]

Options

      --additional-sans strings             additional Subject-Alt-Names for the APIServer certificate
      --config-patch string                 patch generated machineconfigs (applied to all node types)
      --config-patch-control-plane string   patch generated machineconfigs (applied to 'init' and 'controlplane' types)
      --config-patch-worker string          patch generated machineconfigs (applied to 'worker' type)
      --dns-domain string                   the dns domain to use for cluster (default "cluster.local")
  -h, --help                                help for config
      --install-disk string                 the disk to install to (default "/dev/sda")
      --install-image string                the image used to perform an installation (default "ghcr.io/talos-systems/installer:latest")
      --kubernetes-version string           desired kubernetes version to run
  -o, --output-dir string                   destination to output generated files
  -p, --persist                             the desired persist value for configs (default true)
      --registry-mirror strings             list of registry mirrors to use in format: <registry host>=<mirror URL>
      --talos-version string                the desired Talos version to generate config for (backwards compatibility, e.g. v0.8)
      --version string                      the desired machine config version to generate (default "v1alpha1")
      --with-cluster-discovery              enable cluster discovery feature (default true)
      --with-docs                           renders all machine configs adding the documentation for each field (default true)
      --with-examples                       renders all machine configs with the commented examples (default true)
      --with-kubespan                       enable KubeSpan feature

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl gen crt

Generates an X.509 Ed25519 certificate

talosctl gen crt [flags]

Options

      --ca string     path to the PEM encoded CERTIFICATE
      --csr string    path to the PEM encoded CERTIFICATE REQUEST
  -h, --help          help for crt
      --hours int     the hours from now on which the certificate validity period ends (default 24)
      --name string   the basename of the generated file

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl gen csr

Generates a CSR using an Ed25519 private key

talosctl gen csr [flags]

Options

  -h, --help            help for csr
      --ip string       generate the certificate for this IP address
      --key string      path to the PEM encoded EC or RSA PRIVATE KEY
      --roles strings   roles (default [os:admin])

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl gen key

Generates an Ed25519 private key

talosctl gen key [flags]

Options

  -h, --help          help for key
      --name string   the basename of the generated file

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl gen keypair

Generates an X.509 Ed25519 key pair

talosctl gen keypair [flags]

Options

  -h, --help                  help for keypair
      --ip string             generate the certificate for this IP address
      --organization string   X.509 distinguished name for the Organization

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl gen

Generate CAs, certificates, and private keys

Options

  -h, --help   help for gen

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl get

Get a specific resource or list of resources.

Synopsis

Similar to ‘kubectl get’, ’talosctl get’ returns a set of resources from the OS. To get a list of all available resource definitions, issue ’talosctl get rd’

talosctl get <type> [<id>] [flags]

Options

  -h, --help               help for get
  -i, --insecure           get resources using the insecure (encrypted with no auth) maintenance service
      --namespace string   resource namespace (default is to use default namespace per resource)
  -o, --output string      output mode (json, table, yaml) (default "table")
  -w, --watch              watch resource changes

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl health

Check cluster health

talosctl health [flags]

Options

      --control-plane-nodes strings   specify IPs of control plane nodes
  -h, --help                          help for health
      --init-node string              specify IPs of init node
      --k8s-endpoint string           use endpoint instead of kubeconfig default
      --run-e2e                       run Kubernetes e2e test
      --server                        run server-side check (default true)
      --wait-timeout duration         timeout to wait for the cluster to be ready (default 20m0s)
      --worker-nodes strings          specify IPs of worker nodes

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl images

List the default images used by Talos

talosctl images [flags]

Options

  -h, --help   help for images

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl inspect dependencies

Inspect controller-resource dependencies as graphviz graph.

Synopsis

Inspect controller-resource dependencies as graphviz graph.

Pipe the output of the command through the “dot” program (part of graphviz package) to render the graph:

talosctl inspect dependencies | dot -Tpng > graph.png

talosctl inspect dependencies [flags]

Options

  -h, --help             help for dependencies
      --with-resources   display live resource information with dependencies

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl inspect

Inspect internals of Talos

Options

  -h, --help   help for inspect

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

talosctl kubeconfig

Download the admin kubeconfig from the node

Synopsis

Download the admin kubeconfig from the node. If merge flag is defined, config will be merged with ~/.kube/config or [local-path] if specified. Otherwise kubeconfig will be written to PWD or [local-path] if specified.

talosctl kubeconfig [local-path] [flags]

Options

  -f, --force                       Force overwrite of kubeconfig if already present, force overwrite on kubeconfig merge
      --force-context-name string   Force context name for kubeconfig merge
  -h, --help                        help for kubeconfig
  -m, --merge                       Merge with existing kubeconfig (default true)

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl list

Retrieve a directory listing

talosctl list [path] [flags]

Options

  -d, --depth int32    maximum recursion depth
  -h, --help           help for list
  -H, --humanize       humanize size and time in the output
  -l, --long           display additional file details
  -r, --recurse        recurse into subdirectories
  -t, --type strings   filter by specified types:
                       f	regular file
                       d	directory
                       l, L	symbolic link

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl logs

Retrieve logs for a service

talosctl logs <service name> [flags]

Options

  -f, --follow       specify if the logs should be streamed
  -h, --help         help for logs
  -k, --kubernetes   use the k8s.io containerd namespace
      --tail int32   lines of log file to display (default is to show from the beginning) (default -1)

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl memory

Show memory usage

talosctl memory [flags]

Options

  -h, --help      help for memory
  -v, --verbose   display extended memory statistics

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl mounts

List mounts

talosctl mounts [flags]

Options

  -h, --help   help for mounts

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl patch

Update field(s) of a resource using a JSON patch.

talosctl patch <type> [<id>] [flags]

Options

  -h, --help                help for patch
      --immediate           apply the change immediately (without a reboot)
      --namespace string    resource namespace (default is to use default namespace per resource)
      --on-reboot           apply the change on next reboot
  -p, --patch string        the patch to be applied to the resource file.
      --patch-file string   a file containing a patch to be applied to the resource.

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl processes

List running processes

talosctl processes [flags]

Options

  -h, --help          help for processes
  -s, --sort string   Column to sort output by. [rss|cpu] (default "rss")
  -w, --watch         Stream running processes

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl read

Read a file on the machine

talosctl read <path> [flags]

Options

  -h, --help   help for read

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl reboot

Reboot a node

talosctl reboot [flags]

Options

  -h, --help          help for reboot
  -m, --mode string   select the reboot mode: "default", "powercyle" (skips kexec) (default "default")

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl reset

Reset a node

talosctl reset [flags]

Options

      --graceful                        if true, attempt to cordon/drain node and leave etcd (if applicable) (default true)
  -h, --help                            help for reset
      --reboot                          if true, reboot the node after resetting instead of shutting down
      --system-labels-to-wipe strings   if set, just wipe selected system disk partitions by label but keep other partitions intact

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl restart

Restart a process

talosctl restart <id> [flags]

Options

  -h, --help         help for restart
  -k, --kubernetes   use the k8s.io containerd namespace

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl rollback

Rollback a node to the previous installation

talosctl rollback [flags]

Options

  -h, --help   help for rollback

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl service

Retrieve the state of a service (or all services), control service state

Synopsis

Service control command. If run without arguments, lists all the services and their state. If service ID is specified, default action ‘status’ is executed which shows status of a single list service. With actions ‘start’, ‘stop’, ‘restart’, service state is updated respectively.

talosctl service [<id> [start|stop|restart|status]] [flags]

Options

  -h, --help   help for service

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl shutdown

Shutdown a node

talosctl shutdown [flags]

Options

  -h, --help   help for shutdown

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl stats

Get container stats

talosctl stats [flags]

Options

  -h, --help         help for stats
  -k, --kubernetes   use the k8s.io containerd namespace

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl support

Dump debug information about the cluster

Synopsis

Generated bundle contains the following debug information:

  • For each node:

    • Kernel logs.
    • All Talos internal services logs.
    • All kube-system pods logs.
    • Talos COSI resources without secrets.
    • COSI runtime state graph.
    • Processes snapshot.
    • IO pressure snapshot.
    • Mounts list.
    • PCI devices info.
    • Talos version.
  • For the cluster:

    • Kubernetes nodes and kube-system pods manifests.
talosctl support [flags]

Options

  -h, --help              help for support
  -w, --num-workers int   number of workers per node (default 1)
  -O, --output string     output file to write support archive to
  -v, --verbose           verbose output

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl time

Gets current server time

talosctl time [--check server] [flags]

Options

  -c, --check string   checks server time against specified ntp server
  -h, --help           help for time

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl upgrade

Upgrade Talos on the target node

talosctl upgrade [flags]

Options

  -f, --force          force the upgrade (skip checks on etcd health and members, might lead to data loss)
  -h, --help           help for upgrade
  -i, --image string   the container image to use for performing the install
  -p, --preserve       preserve data
  -s, --stage          stage the upgrade to perform it after a reboot

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl upgrade-k8s

Upgrade Kubernetes control plane in the Talos cluster.

Synopsis

Command runs upgrade of Kubernetes control plane components between specified versions.

talosctl upgrade-k8s [flags]

Options

      --dry-run           skip the actual upgrade and show the upgrade plan instead
      --endpoint string   the cluster control plane endpoint
      --from string       the Kubernetes control plane version to upgrade from
  -h, --help              help for upgrade-k8s
      --to string         the Kubernetes control plane version to upgrade to (default "1.23.1")
      --upgrade-kubelet   upgrade kubelet service (default true)

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl usage

Retrieve a disk usage

talosctl usage [path1] [path2] ... [pathN] [flags]

Options

  -a, --all             write counts for all files, not just directories
  -d, --depth int32     maximum recursion depth
  -h, --help            help for usage
  -H, --humanize        humanize size and time in the output
  -t, --threshold int   threshold exclude entries smaller than SIZE if positive, or entries greater than SIZE if negative

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl validate

Validate config

talosctl validate [flags]

Options

  -c, --config string   the path of the config file
  -h, --help            help for validate
  -m, --mode string     the mode to validate the config for (valid values are metal, cloud, and container)
      --strict          treat validation warnings as errors

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl version

Prints the version

talosctl version [flags]

Options

      --client   Print client version only
  -h, --help     help for version
      --short    Print the short version

Options inherited from parent commands

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

  • talosctl - A CLI for out-of-band management of Kubernetes nodes created by Talos

talosctl

A CLI for out-of-band management of Kubernetes nodes created by Talos

Options

      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -h, --help                 help for talosctl
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "/home/user/.talos/config")

SEE ALSO

8.3 - Configuration

Package v1alpha1 configuration file contains all the options available for configuring a machine.

To generate a set of basic configuration files, run:

talosctl gen config --version v1alpha1 <cluster name> <cluster endpoint>

This will generate a machine config for each node type, and a talosconfig for the CLI.

Config

Config defines the v1alpha1 configuration file.

version: v1alpha1
persist: true
machine: # ...
cluster: # ...

version string

Indicates the schema used to decode the contents.

Valid values:

  • v1alpha1

debug bool

Enable verbose logging to the console. All system containers logs will flow into serial console.

Note: To avoid breaking Talos bootstrap flow enable this option only if serial console can handle high message throughput.

Valid values:

  • true

  • yes

  • false

  • no


persist bool

Indicates whether to pull the machine config upon every boot.

Valid values:

  • true

  • yes

  • false

  • no


Provides machine specific configuration options.


Provides cluster specific configuration options.


MachineConfig

MachineConfig represents the machine-specific config values.

Appears in:

type: controlplane
# InstallConfig represents the installation options for preparing a node.
install:
    disk: /dev/sda # The disk used for installations.
    # Allows for supplying extra kernel args via the bootloader.
    extraKernelArgs:
        - console=ttyS1
        - panic=10
    image: ghcr.io/talos-systems/installer:latest # Allows for supplying the image used to perform the installation.
    bootloader: true # Indicates if a bootloader should be installed.
    wipe: false # Indicates if the installation disk should be wiped at installation time.

    # # Look up disk using disk attributes like model, size, serial and others.
    # diskSelector:
    #     size: 4GB # Disk size.
    #     model: WDC* # Disk model `/sys/block/<dev>/device/model`.

type string

Defines the role of the machine within the cluster.

Init

Init node type designates the first control plane node to come up. You can think of it like a bootstrap node. This node will perform the initial steps to bootstrap the cluster – generation of TLS assets, starting of the control plane, etc.

Control Plane

Control Plane node type designates the node as a control plane member. This means it will host etcd along with the Kubernetes master components such as API Server, Controller Manager, Scheduler.

Worker

Worker node type designates the node as a worker node. This means it will be an available compute node for scheduling workloads.

This node type was previously known as “join”; that value is still supported but deprecated.

Valid values:

  • init

  • controlplane

  • worker


token string

The token is used by a machine to join the PKI of the cluster. Using this token, a machine will create a certificate signing request (CSR), and request a certificate that will be used as its’ identity.

Warning: It is important to ensure that this token is correct since a machine’s certificate has a short TTL by default.

Examples:

token: 328hom.uqjzh6jnn2eie9oi

ca PEMEncodedCertificateAndKey

The root certificate authority of the PKI. It is composed of a base64 encoded crt and key.

Examples:

ca:
    crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
    key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

certSANs []string

Extra certificate subject alternative names for the machine’s certificate. By default, all non-loopback interface IPs are automatically added to the certificate’s SANs.

Examples:

certSANs:
    - 10.0.0.10
    - 172.16.0.10
    - 192.168.0.10

Provides machine specific contolplane configuration options.

Examples:

controlPlane:
    # Controller manager machine specific configuration options.
    controllerManager:
        disabled: false # Disable kube-controller-manager on the node.
    # Scheduler machine specific configuration options.
    scheduler:
        disabled: true # Disable kube-scheduler on the node.

Used to provide additional options to the kubelet.

Examples:

kubelet:
    image: ghcr.io/talos-systems/kubelet:v1.23.1 # The `image` field is an optional reference to an alternative kubelet image.
    # The `extraArgs` field is used to provide additional flags to the kubelet.
    extraArgs:
        feature-gates: ServerSideApply=true

    # # The `ClusterDNS` field is an optional reference to an alternative kubelet clusterDNS ip list.
    # clusterDNS:
    #     - 10.96.0.10
    #     - 169.254.2.53

    # # The `extraMounts` field is used to add additional mounts to the kubelet container.
    # extraMounts:
    #     - destination: /var/lib/example
    #       type: bind
    #       source: /var/lib/example
    #       options:
    #         - bind
    #         - rshared
    #         - rw

    # # The `nodeIP` field is used to configure `--node-ip` flag for the kubelet.
    # nodeIP:
    #     # The `validSubnets` field configures the networks to pick kubelet node IP from.
    #     validSubnets:
    #         - 10.0.0.0/8
    #         - '!10.0.0.3/32'
    #         - fdc7::/16

Provides machine specific network configuration options.

Examples:

network:
    hostname: worker-1 # Used to statically set the hostname for the machine.
    # `interfaces` is used to define the network interface configuration.
    interfaces:
        - interface: eth0 # The interface name.
          # Assigns static IP addresses to the interface.
          addresses:
            - 192.168.2.0/24
          # A list of routes associated with the interface.
          routes:
            - network: 0.0.0.0/0 # The route's network.
              gateway: 192.168.2.1 # The route's gateway.
              metric: 1024 # The optional metric for the route.
          mtu: 1500 # The interface's MTU.

          # # Bond specific options.
          # bond:
          #     # The interfaces that make up the bond.
          #     interfaces:
          #         - eth0
          #         - eth1
          #     mode: 802.3ad # A bond option.
          #     lacpRate: fast # A bond option.

          # # Indicates if DHCP should be used to configure the interface.
          # dhcp: true

          # # DHCP specific options.
          # dhcpOptions:
          #     routeMetric: 1024 # The priority of all routes received via DHCP.

          # # Wireguard specific configuration.

          # # wireguard server example
          # wireguard:
          #     privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
          #     listenPort: 51111 # Specifies a device's listening port.
          #     # Specifies a list of peer configurations to apply to a device.
          #     peers:
          #         - publicKey: ABCDEF... # Specifies the public key of this peer.
          #           endpoint: 192.168.1.3 # Specifies the endpoint of this peer entry.
          #           # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
          #           allowedIPs:
          #             - 192.168.1.0/24
          # # wireguard peer example
          # wireguard:
          #     privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
          #     # Specifies a list of peer configurations to apply to a device.
          #     peers:
          #         - publicKey: ABCDEF... # Specifies the public key of this peer.
          #           endpoint: 192.168.1.2 # Specifies the endpoint of this peer entry.
          #           persistentKeepaliveInterval: 10s # Specifies the persistent keepalive interval for this peer.
          #           # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
          #           allowedIPs:
          #             - 192.168.1.0/24

          # # Virtual (shared) IP address configuration.
          # vip:
          #     ip: 172.16.199.55 # Specifies the IP address to be used.
    # Used to statically set the nameservers for the machine.
    nameservers:
        - 9.8.7.6
        - 8.7.6.5

    # # Allows for extra entries to be added to the `/etc/hosts` file
    # extraHostEntries:
    #     - ip: 192.168.1.100 # The IP of the host.
    #       # The host alias.
    #       aliases:
    #         - example
    #         - example.domain.tld

    # # Configures KubeSpan feature.
    # kubespan:
    #     enabled: true # Enable the KubeSpan feature.

disks []MachineDisk

Used to partition, format and mount additional disks. Since the rootfs is read only with the exception of /var, mounts are only valid if they are under /var. Note that the partitioning and formating is done only once, if and only if no existing partitions are found. If size: is omitted, the partition is sized to occupy the full disk.

Note: size is in units of bytes.

Examples:

disks:
    - device: /dev/sdb # The name of the disk to use.
      # A list of partitions to create on the disk.
      partitions:
        - mountpoint: /var/mnt/extra # Where to mount the partition.

          # # The size of partition: either bytes or human readable representation. If `size:` is omitted, the partition is sized to occupy the full disk.

          # # Human readable representation.
          # size: 100 MB
          # # Precise value in bytes.
          # size: 1073741824

Used to provide instructions for installations.

Examples:

install:
    disk: /dev/sda # The disk used for installations.
    # Allows for supplying extra kernel args via the bootloader.
    extraKernelArgs:
        - console=ttyS1
        - panic=10
    image: ghcr.io/talos-systems/installer:latest # Allows for supplying the image used to perform the installation.
    bootloader: true # Indicates if a bootloader should be installed.
    wipe: false # Indicates if the installation disk should be wiped at installation time.

    # # Look up disk using disk attributes like model, size, serial and others.
    # diskSelector:
    #     size: 4GB # Disk size.
    #     model: WDC* # Disk model `/sys/block/<dev>/device/model`.

files []MachineFile

Allows the addition of user specified files. The value of op can be create, overwrite, or append. In the case of create, path must not exist. In the case of overwrite, and append, path must be a valid file. If an op value of append is used, the existing file will be appended. Note that the file contents are not required to be base64 encoded.

Note: The specified path is relative to /var.

Examples:

files:
    - content: '...' # The contents of the file.
      permissions: 0o666 # The file's permissions in octal.
      path: /tmp/file.txt # The path of the file.
      op: append # The operation to use

env Env

The env field allows for the addition of environment variables. All environment variables are set on PID 1 in addition to every service.

Valid values:

  • GRPC_GO_LOG_VERBOSITY_LEVEL

  • GRPC_GO_LOG_SEVERITY_LEVEL

  • http_proxy

  • https_proxy

  • no_proxy

Examples:

env:
    GRPC_GO_LOG_SEVERITY_LEVEL: info
    GRPC_GO_LOG_VERBOSITY_LEVEL: "99"
    https_proxy: http://SERVER:PORT/
env:
    GRPC_GO_LOG_SEVERITY_LEVEL: error
    https_proxy: https://USERNAME:PASSWORD@SERVER:PORT/
env:
    https_proxy: http://DOMAIN\USERNAME:PASSWORD@SERVER:PORT/

Used to configure the machine’s time settings.

Examples:

time:
    disabled: false # Indicates if the time service is disabled for the machine.
    # Specifies time (NTP) servers to use for setting the system time.
    servers:
        - time.cloudflare.com
    bootTimeout: 2m0s # Specifies the timeout when the node time is considered to be in sync unlocking the boot sequence.

sysctls map[string]string

Used to configure the machine’s sysctls.

Examples:

sysctls:
    kernel.domainname: talos.dev
    net.ipv4.ip_forward: "0"

registries RegistriesConfig

Used to configure the machine’s container image registry mirrors.

Automatically generates matching CRI configuration for registry mirrors.

The mirrors section allows to redirect requests for images to non-default registry, which might be local registry or caching mirror.

The config section provides a way to authenticate to the registry with TLS client identity, provide registry CA, or authentication information. Authentication information has same meaning with the corresponding field in .docker/config.json.

See also matching configuration for CRI containerd plugin.

Examples:

registries:
    # Specifies mirror configuration for each registry.
    mirrors:
        docker.io:
            # List of endpoints (URLs) for registry mirrors to use.
            endpoints:
                - https://registry.local
    # Specifies TLS & auth configuration for HTTPS image registries.
    config:
        registry.local:
            # The TLS configuration for the registry.
            tls:
                # Enable mutual TLS authentication with the registry.
                clientIdentity:
                    crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
                    key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u
            # The auth configuration for this registry.
            auth:
                username: username # Optional registry authentication.
                password: password # Optional registry authentication.

systemDiskEncryption SystemDiskEncryptionConfig

Machine system disk encryption configuration. Defines each system partition encryption parameters.

Examples:

systemDiskEncryption:
    # Ephemeral partition encryption.
    ephemeral:
        provider: luks2 # Encryption provider to use for the encryption.
        # Defines the encryption keys generation and storage method.
        keys:
            - # Deterministically generated key from the node UUID and PartitionLabel.
              nodeID: {}
              slot: 0 # Key slot number for LUKS2 encryption.

        # # Cipher kind to use for the encryption. Depends on the encryption provider.
        # cipher: aes-xts-plain64

        # # Defines the encryption sector size.
        # blockSize: 4096

        # # Additional --perf parameters for the LUKS2 encryption.
        # options:
        #     - no_read_workqueue
        #     - no_write_workqueue

Features describe individual Talos features that can be switched on or off.

Examples:

features:
    rbac: true # Enable role-based access control (RBAC).

Configures the udev system.

Examples:

udev:
    # List of udev rules to apply to the udev system
    rules:
        - SUBSYSTEM=="drm", KERNEL=="renderD*", GROUP="44", MODE="0660"

Configures the logging system.

Examples:

logging:
    # Logging destination.
    destinations:
        - endpoint: tcp://1.2.3.4:12345 # Where to send logs. Supported protocols are "tcp" and "udp".
          format: json_lines # Logs format.

ClusterConfig

ClusterConfig represents the cluster-wide config values.

Appears in:

# ControlPlaneConfig represents the control plane configuration options.
controlPlane:
    endpoint: https://1.2.3.4 # Endpoint is the canonical controlplane endpoint, which can be an IP address or a DNS hostname.
    localAPIServerPort: 443 # The port that the API server listens on internally.
clusterName: talos.local
# ClusterNetworkConfig represents kube networking configuration options.
network:
    # The CNI used.
    cni:
        name: flannel # Name of CNI to use.
    dnsDomain: cluster.local # The domain used by Kubernetes DNS.
    # The pod subnet CIDR.
    podSubnets:
        - 10.244.0.0/16
    # The service subnet CIDR.
    serviceSubnets:
        - 10.96.0.0/12

id string

Globally unique identifier for this cluster (base64 encoded random 32 bytes).


secret string

Shared secret of cluster (base64 encoded random 32 bytes). This secret is shared among cluster members but should never be sent over the network.


controlPlane ControlPlaneConfig

Provides control plane specific configuration options.

Examples:

controlPlane:
    endpoint: https://1.2.3.4 # Endpoint is the canonical controlplane endpoint, which can be an IP address or a DNS hostname.
    localAPIServerPort: 443 # The port that the API server listens on internally.

clusterName string

Configures the cluster’s name.


Provides cluster specific network configuration options.

Examples:

network:
    # The CNI used.
    cni:
        name: flannel # Name of CNI to use.
    dnsDomain: cluster.local # The domain used by Kubernetes DNS.
    # The pod subnet CIDR.
    podSubnets:
        - 10.244.0.0/16
    # The service subnet CIDR.
    serviceSubnets:
        - 10.96.0.0/12

token string

The bootstrap token used to join the cluster.

Examples:

token: wlzjyw.bei2zfylhs2by0wd

aescbcEncryptionSecret string

The key used for the encryption of secret data at rest.

Examples:

aescbcEncryptionSecret: z01mye6j16bspJYtTB/5SFX8j7Ph4JXxM2Xuu4vsBPM=

ca PEMEncodedCertificateAndKey

The base64 encoded root certificate authority used by Kubernetes.

Examples:

ca:
    crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
    key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

aggregatorCA PEMEncodedCertificateAndKey

The base64 encoded aggregator certificate authority used by Kubernetes for front-proxy certificate generation.

This CA can be self-signed.

Examples:

aggregatorCA:
    crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
    key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

serviceAccount PEMEncodedKey

The base64 encoded private key for service account token generation.

Examples:

serviceAccount:
    key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

apiServer APIServerConfig

API server specific configuration options.

Examples:

apiServer:
    image: k8s.gcr.io/kube-apiserver:v1.23.1 # The container image used in the API server manifest.
    # Extra arguments to supply to the API server.
    extraArgs:
        feature-gates: ServerSideApply=true
        http2-max-streams-per-connection: "32"
    # Extra certificate subject alternative names for the API server's certificate.
    certSANs:
        - 1.2.3.4
        - 4.5.6.7

controllerManager ControllerManagerConfig

Controller manager server specific configuration options.

Examples:

controllerManager:
    image: k8s.gcr.io/kube-controller-manager:v1.23.1 # The container image used in the controller manager manifest.
    # Extra arguments to supply to the controller manager.
    extraArgs:
        feature-gates: ServerSideApply=true

Kube-proxy server-specific configuration options

Examples:

proxy:
    image: k8s.gcr.io/kube-proxy:v1.23.1 # The container image used in the kube-proxy manifest.
    mode: ipvs # proxy mode of kube-proxy.
    # Extra arguments to supply to kube-proxy.
    extraArgs:
        proxy-mode: iptables

scheduler SchedulerConfig

Scheduler server specific configuration options.

Examples:

scheduler:
    image: k8s.gcr.io/kube-scheduler:v1.23.1 # The container image used in the scheduler manifest.
    # Extra arguments to supply to the scheduler.
    extraArgs:
        feature-gates: AllBeta=true

Configures cluster member discovery.

Examples:

discovery:
    enabled: true # Enable the cluster membership discovery feature.
    # Configure registries used for cluster member discovery.
    registries:
        # Kubernetes registry uses Kubernetes API server to discover cluster members and stores additional information
        kubernetes: {}
        # Service registry is using an external service to push and pull information about cluster members.
        service:
            endpoint: https://discovery.talos.dev/ # External service endpoint.

Etcd specific configuration options.

Examples:

etcd:
    image: gcr.io/etcd-development/etcd:v3.5.1 # The container image used to create the etcd service.
    # The `ca` is the root certificate authority of the PKI.
    ca:
        crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
        key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u
    # Extra arguments to supply to etcd.
    extraArgs:
        election-timeout: "5000"

    # # The subnet from which the advertise URL should be.
    # subnet: 10.0.0.0/8

coreDNS CoreDNS

Core DNS specific configuration options.

Examples:

coreDNS:
    image: docker.io/coredns/coredns:1.8.6 # The `image` field is an override to the default coredns image.

externalCloudProvider ExternalCloudProviderConfig

External cloud provider configuration.

Examples:

externalCloudProvider:
    enabled: true # Enable external cloud provider.
    # A list of urls that point to additional manifests for an external cloud provider.
    manifests:
        - https://raw.githubusercontent.com/kubernetes/cloud-provider-aws/v1.20.0-alpha.0/manifests/rbac.yaml
        - https://raw.githubusercontent.com/kubernetes/cloud-provider-aws/v1.20.0-alpha.0/manifests/aws-cloud-controller-manager-daemonset.yaml

extraManifests []string

A list of urls that point to additional manifests. These will get automatically deployed as part of the bootstrap.

Examples:

extraManifests:
    - https://www.example.com/manifest1.yaml
    - https://www.example.com/manifest2.yaml

extraManifestHeaders map[string]string

A map of key value pairs that will be added while fetching the extraManifests.

Examples:

extraManifestHeaders:
    Token: "1234567"
    X-ExtraInfo: info

inlineManifests ClusterInlineManifests

A list of inline Kubernetes manifests. These will get automatically deployed as part of the bootstrap.

Examples:

inlineManifests:
    - name: namespace-ci # Name of the manifest.
      contents: |- # Manifest contents as a string.
        apiVersion: v1
        kind: Namespace
        metadata:
        	name: ci

adminKubeconfig AdminKubeconfigConfig

Settings for admin kubeconfig generation. Certificate lifetime can be configured.

Examples:

adminKubeconfig:
    certLifetime: 1h0m0s # Admin kubeconfig certificate lifetime (default is 1 year).

allowSchedulingOnMasters bool

Allows running workload on master nodes.

Valid values:

  • true

  • yes

  • false

  • no


ExtraMount

ExtraMount wraps OCI Mount specification.

Appears in:

- destination: /var/lib/example
  type: bind
  source: /var/lib/example
  options:
    - bind
    - rshared
    - rw

MachineControlPlaneConfig

MachineControlPlaneConfig machine specific configuration options.

Appears in:

# Controller manager machine specific configuration options.
controllerManager:
    disabled: false # Disable kube-controller-manager on the node.
# Scheduler machine specific configuration options.
scheduler:
    disabled: true # Disable kube-scheduler on the node.

Controller manager machine specific configuration options.


Scheduler machine specific configuration options.


MachineControllerManagerConfig

MachineControllerManagerConfig represents the machine specific ControllerManager config values.

Appears in:


disabled bool

Disable kube-controller-manager on the node.


MachineSchedulerConfig

MachineSchedulerConfig represents the machine specific Scheduler config values.

Appears in:


disabled bool

Disable kube-scheduler on the node.


KubeletConfig

KubeletConfig represents the kubelet config values.

Appears in:

image: ghcr.io/talos-systems/kubelet:v1.23.1 # The `image` field is an optional reference to an alternative kubelet image.
# The `extraArgs` field is used to provide additional flags to the kubelet.
extraArgs:
    feature-gates: ServerSideApply=true

# # The `ClusterDNS` field is an optional reference to an alternative kubelet clusterDNS ip list.
# clusterDNS:
#     - 10.96.0.10
#     - 169.254.2.53

# # The `extraMounts` field is used to add additional mounts to the kubelet container.
# extraMounts:
#     - destination: /var/lib/example
#       type: bind
#       source: /var/lib/example
#       options:
#         - bind
#         - rshared
#         - rw

# # The `nodeIP` field is used to configure `--node-ip` flag for the kubelet.
# nodeIP:
#     # The `validSubnets` field configures the networks to pick kubelet node IP from.
#     validSubnets:
#         - 10.0.0.0/8
#         - '!10.0.0.3/32'
#         - fdc7::/16

image string

The image field is an optional reference to an alternative kubelet image.

Examples:

image: ghcr.io/talos-systems/kubelet:v1.23.1

clusterDNS []string

The ClusterDNS field is an optional reference to an alternative kubelet clusterDNS ip list.

Examples:

clusterDNS:
    - 10.96.0.10
    - 169.254.2.53

extraArgs map[string]string

The extraArgs field is used to provide additional flags to the kubelet.

Examples:

extraArgs:
    key: value

extraMounts []ExtraMount

The extraMounts field is used to add additional mounts to the kubelet container. Note that either bind or rbind are required in the options.

Examples:

extraMounts:
    - destination: /var/lib/example
      type: bind
      source: /var/lib/example
      options:
        - bind
        - rshared
        - rw

registerWithFQDN bool

The registerWithFQDN field is used to force kubelet to use the node FQDN for registration. This is required in clouds like AWS.

Valid values:

  • true

  • yes

  • false

  • no


The nodeIP field is used to configure --node-ip flag for the kubelet. This is used when a node has multiple addresses to choose from.

Examples:

nodeIP:
    # The `validSubnets` field configures the networks to pick kubelet node IP from.
    validSubnets:
        - 10.0.0.0/8
        - '!10.0.0.3/32'
        - fdc7::/16

KubeletNodeIPConfig

KubeletNodeIPConfig represents the kubelet node IP configuration.

Appears in:

# The `validSubnets` field configures the networks to pick kubelet node IP from.
validSubnets:
    - 10.0.0.0/8
    - '!10.0.0.3/32'
    - fdc7::/16

validSubnets []string

The validSubnets field configures the networks to pick kubelet node IP from. For dual stack configuration, there should be two subnets: one for IPv4, another for IPv6. IPs can be excluded from the list by using negative match with !, e.g !10.0.0.0/8. Negative subnet matches should be specified last to filter out IPs picked by positive matches. If not specified, node IP is picked based on cluster podCIDRs: IPv4/IPv6 address or both.


NetworkConfig

NetworkConfig represents the machine’s networking config values.

Appears in:

hostname: worker-1 # Used to statically set the hostname for the machine.
# `interfaces` is used to define the network interface configuration.
interfaces:
    - interface: eth0 # The interface name.
      # Assigns static IP addresses to the interface.
      addresses:
        - 192.168.2.0/24
      # A list of routes associated with the interface.
      routes:
        - network: 0.0.0.0/0 # The route's network.
          gateway: 192.168.2.1 # The route's gateway.
          metric: 1024 # The optional metric for the route.
      mtu: 1500 # The interface's MTU.

      # # Bond specific options.
      # bond:
      #     # The interfaces that make up the bond.
      #     interfaces:
      #         - eth0
      #         - eth1
      #     mode: 802.3ad # A bond option.
      #     lacpRate: fast # A bond option.

      # # Indicates if DHCP should be used to configure the interface.
      # dhcp: true

      # # DHCP specific options.
      # dhcpOptions:
      #     routeMetric: 1024 # The priority of all routes received via DHCP.

      # # Wireguard specific configuration.

      # # wireguard server example
      # wireguard:
      #     privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
      #     listenPort: 51111 # Specifies a device's listening port.
      #     # Specifies a list of peer configurations to apply to a device.
      #     peers:
      #         - publicKey: ABCDEF... # Specifies the public key of this peer.
      #           endpoint: 192.168.1.3 # Specifies the endpoint of this peer entry.
      #           # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
      #           allowedIPs:
      #             - 192.168.1.0/24
      # # wireguard peer example
      # wireguard:
      #     privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
      #     # Specifies a list of peer configurations to apply to a device.
      #     peers:
      #         - publicKey: ABCDEF... # Specifies the public key of this peer.
      #           endpoint: 192.168.1.2 # Specifies the endpoint of this peer entry.
      #           persistentKeepaliveInterval: 10s # Specifies the persistent keepalive interval for this peer.
      #           # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
      #           allowedIPs:
      #             - 192.168.1.0/24

      # # Virtual (shared) IP address configuration.
      # vip:
      #     ip: 172.16.199.55 # Specifies the IP address to be used.
# Used to statically set the nameservers for the machine.
nameservers:
    - 9.8.7.6
    - 8.7.6.5

# # Allows for extra entries to be added to the `/etc/hosts` file
# extraHostEntries:
#     - ip: 192.168.1.100 # The IP of the host.
#       # The host alias.
#       aliases:
#         - example
#         - example.domain.tld

# # Configures KubeSpan feature.
# kubespan:
#     enabled: true # Enable the KubeSpan feature.

hostname string

Used to statically set the hostname for the machine.


interfaces []Device

interfaces is used to define the network interface configuration. By default all network interfaces will attempt a DHCP discovery. This can be further tuned through this configuration parameter.

Examples:

interfaces:
    - interface: eth0 # The interface name.
      # Assigns static IP addresses to the interface.
      addresses:
        - 192.168.2.0/24
      # A list of routes associated with the interface.
      routes:
        - network: 0.0.0.0/0 # The route's network.
          gateway: 192.168.2.1 # The route's gateway.
          metric: 1024 # The optional metric for the route.
      mtu: 1500 # The interface's MTU.

      # # Bond specific options.
      # bond:
      #     # The interfaces that make up the bond.
      #     interfaces:
      #         - eth0
      #         - eth1
      #     mode: 802.3ad # A bond option.
      #     lacpRate: fast # A bond option.

      # # Indicates if DHCP should be used to configure the interface.
      # dhcp: true

      # # DHCP specific options.
      # dhcpOptions:
      #     routeMetric: 1024 # The priority of all routes received via DHCP.

      # # Wireguard specific configuration.

      # # wireguard server example
      # wireguard:
      #     privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
      #     listenPort: 51111 # Specifies a device's listening port.
      #     # Specifies a list of peer configurations to apply to a device.
      #     peers:
      #         - publicKey: ABCDEF... # Specifies the public key of this peer.
      #           endpoint: 192.168.1.3 # Specifies the endpoint of this peer entry.
      #           # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
      #           allowedIPs:
      #             - 192.168.1.0/24
      # # wireguard peer example
      # wireguard:
      #     privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
      #     # Specifies a list of peer configurations to apply to a device.
      #     peers:
      #         - publicKey: ABCDEF... # Specifies the public key of this peer.
      #           endpoint: 192.168.1.2 # Specifies the endpoint of this peer entry.
      #           persistentKeepaliveInterval: 10s # Specifies the persistent keepalive interval for this peer.
      #           # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
      #           allowedIPs:
      #             - 192.168.1.0/24

      # # Virtual (shared) IP address configuration.
      # vip:
      #     ip: 172.16.199.55 # Specifies the IP address to be used.

nameservers []string

Used to statically set the nameservers for the machine. Defaults to 1.1.1.1 and 8.8.8.8

Examples:

nameservers:
    - 8.8.8.8
    - 1.1.1.1

extraHostEntries []ExtraHost

Allows for extra entries to be added to the /etc/hosts file

Examples:

extraHostEntries:
    - ip: 192.168.1.100 # The IP of the host.
      # The host alias.
      aliases:
        - example
        - example.domain.tld

Configures KubeSpan feature.

Examples:

kubespan:
    enabled: true # Enable the KubeSpan feature.

InstallConfig

InstallConfig represents the installation options for preparing a node.

Appears in:

disk: /dev/sda # The disk used for installations.
# Allows for supplying extra kernel args via the bootloader.
extraKernelArgs:
    - console=ttyS1
    - panic=10
image: ghcr.io/talos-systems/installer:latest # Allows for supplying the image used to perform the installation.
bootloader: true # Indicates if a bootloader should be installed.
wipe: false # Indicates if the installation disk should be wiped at installation time.

# # Look up disk using disk attributes like model, size, serial and others.
# diskSelector:
#     size: 4GB # Disk size.
#     model: WDC* # Disk model `/sys/block/<dev>/device/model`.

disk string

The disk used for installations.

Examples:

disk: /dev/sda
disk: /dev/nvme0

diskSelector InstallDiskSelector

Look up disk using disk attributes like model, size, serial and others. Always has priority over disk.

Examples:

diskSelector:
    size: 4GB # Disk size.
    model: WDC* # Disk model `/sys/block/<dev>/device/model`.

extraKernelArgs []string

Allows for supplying extra kernel args via the bootloader.

Examples:

extraKernelArgs:
    - talos.platform=metal
    - reboot=k

image string

Allows for supplying the image used to perform the installation. Image reference for each Talos release can be found on GitHub releases page.

Examples:

image: ghcr.io/talos-systems/installer:latest

bootloader bool

Indicates if a bootloader should be installed.

Valid values:

  • true

  • yes

  • false

  • no


wipe bool

Indicates if the installation disk should be wiped at installation time. Defaults to true.

Valid values:

  • true

  • yes

  • false

  • no


legacyBIOSSupport bool

Indicates if MBR partition should be marked as bootable (active). Should be enabled only for the systems with legacy BIOS that doesn’t support GPT partitioning scheme.


InstallDiskSelector

InstallDiskSelector represents a disk query parameters for the install disk lookup.

Appears in:

size: 4GB # Disk size.
model: WDC* # Disk model `/sys/block/<dev>/device/model`.

size InstallDiskSizeMatcher

Disk size.

Examples:

size: 4GB
size: '> 1TB'
size: <= 2TB

name string

Disk name /sys/block/<dev>/device/name.


model string

Disk model /sys/block/<dev>/device/model.


serial string

Disk serial number /sys/block/<dev>/serial.


modalias string

Disk modalias /sys/block/<dev>/device/modalias.


uuid string

Disk UUID /sys/block/<dev>/uuid.


wwid string

Disk WWID /sys/block/<dev>/wwid.


type InstallDiskType

Disk Type.

Valid values:

  • ssd

  • hdd

  • nvme

  • sd


TimeConfig

TimeConfig represents the options for configuring time on a machine.

Appears in:

disabled: false # Indicates if the time service is disabled for the machine.
# Specifies time (NTP) servers to use for setting the system time.
servers:
    - time.cloudflare.com
bootTimeout: 2m0s # Specifies the timeout when the node time is considered to be in sync unlocking the boot sequence.

disabled bool

Indicates if the time service is disabled for the machine. Defaults to false.


servers []string

Specifies time (NTP) servers to use for setting the system time. Defaults to pool.ntp.org


bootTimeout Duration

Specifies the timeout when the node time is considered to be in sync unlocking the boot sequence. NTP sync will be still running in the background. Defaults to “infinity” (waiting forever for time sync)


RegistriesConfig

RegistriesConfig represents the image pull options.

Appears in:

# Specifies mirror configuration for each registry.
mirrors:
    docker.io:
        # List of endpoints (URLs) for registry mirrors to use.
        endpoints:
            - https://registry.local
# Specifies TLS & auth configuration for HTTPS image registries.
config:
    registry.local:
        # The TLS configuration for the registry.
        tls:
            # Enable mutual TLS authentication with the registry.
            clientIdentity:
                crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
                key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u
        # The auth configuration for this registry.
        auth:
            username: username # Optional registry authentication.
            password: password # Optional registry authentication.

mirrors map[string]RegistryMirrorConfig

Specifies mirror configuration for each registry. This setting allows to use local pull-through caching registires, air-gapped installations, etc.

Registry name is the first segment of image identifier, with ‘docker.io’ being default one. To catch any registry names not specified explicitly, use ‘*’.

Examples:

mirrors:
    ghcr.io:
        # List of endpoints (URLs) for registry mirrors to use.
        endpoints:
            - https://registry.insecure
            - https://ghcr.io/v2/

config map[string]RegistryConfig

Specifies TLS & auth configuration for HTTPS image registries. Mutual TLS can be enabled with ‘clientIdentity’ option.

TLS configuration can be skipped if registry has trusted server certificate.

Examples:

config:
    registry.insecure:
        # The TLS configuration for the registry.
        tls:
            insecureSkipVerify: true # Skip TLS server certificate verification (not recommended).

            # # Enable mutual TLS authentication with the registry.
            # clientIdentity:
            #     crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
            #     key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

        # # The auth configuration for this registry.
        # auth:
        #     username: username # Optional registry authentication.
        #     password: password # Optional registry authentication.

PodCheckpointer

PodCheckpointer represents the pod-checkpointer config values.


image string

The image field is an override to the default pod-checkpointer image.


CoreDNS

CoreDNS represents the CoreDNS config values.

Appears in:

image: docker.io/coredns/coredns:1.8.6 # The `image` field is an override to the default coredns image.

disabled bool

Disable coredns deployment on cluster bootstrap.


image string

The image field is an override to the default coredns image.


Endpoint

Endpoint represents the endpoint URL parsed out of the machine config.

Appears in:

https://1.2.3.4:6443
https://cluster1.internal:6443
udp://127.0.0.1:12345
tcp://1.2.3.4:12345

ControlPlaneConfig

ControlPlaneConfig represents the control plane configuration options.

Appears in:

endpoint: https://1.2.3.4 # Endpoint is the canonical controlplane endpoint, which can be an IP address or a DNS hostname.
localAPIServerPort: 443 # The port that the API server listens on internally.

endpoint Endpoint

Endpoint is the canonical controlplane endpoint, which can be an IP address or a DNS hostname. It is single-valued, and may optionally include a port number.

Examples:

endpoint: https://1.2.3.4:6443
endpoint: https://cluster1.internal:6443

localAPIServerPort int

The port that the API server listens on internally. This may be different than the port portion listed in the endpoint field above. The default is 6443.


APIServerConfig

APIServerConfig represents the kube apiserver configuration options.

Appears in:

image: k8s.gcr.io/kube-apiserver:v1.23.1 # The container image used in the API server manifest.
# Extra arguments to supply to the API server.
extraArgs:
    feature-gates: ServerSideApply=true
    http2-max-streams-per-connection: "32"
# Extra certificate subject alternative names for the API server's certificate.
certSANs:
    - 1.2.3.4
    - 4.5.6.7

image string

The container image used in the API server manifest.

Examples:

image: k8s.gcr.io/kube-apiserver:v1.23.1

extraArgs map[string]string

Extra arguments to supply to the API server.


extraVolumes []VolumeMountConfig

Extra volumes to mount to the API server static pod.


certSANs []string

Extra certificate subject alternative names for the API server’s certificate.


disablePodSecurityPolicy bool

Disable PodSecurityPolicy in the API server and default manifests.


ControllerManagerConfig

ControllerManagerConfig represents the kube controller manager configuration options.

Appears in:

image: k8s.gcr.io/kube-controller-manager:v1.23.1 # The container image used in the controller manager manifest.
# Extra arguments to supply to the controller manager.
extraArgs:
    feature-gates: ServerSideApply=true

image string

The container image used in the controller manager manifest.

Examples:

image: k8s.gcr.io/kube-controller-manager:v1.23.1

extraArgs map[string]string

Extra arguments to supply to the controller manager.


extraVolumes []VolumeMountConfig

Extra volumes to mount to the controller manager static pod.


ProxyConfig

ProxyConfig represents the kube proxy configuration options.

Appears in:

image: k8s.gcr.io/kube-proxy:v1.23.1 # The container image used in the kube-proxy manifest.
mode: ipvs # proxy mode of kube-proxy.
# Extra arguments to supply to kube-proxy.
extraArgs:
    proxy-mode: iptables

disabled bool

Disable kube-proxy deployment on cluster bootstrap.

Examples:

disabled: false

image string

The container image used in the kube-proxy manifest.

Examples:

image: k8s.gcr.io/kube-proxy:v1.23.1

mode string

proxy mode of kube-proxy. The default is ‘iptables’.


extraArgs map[string]string

Extra arguments to supply to kube-proxy.


SchedulerConfig

SchedulerConfig represents the kube scheduler configuration options.

Appears in:

image: k8s.gcr.io/kube-scheduler:v1.23.1 # The container image used in the scheduler manifest.
# Extra arguments to supply to the scheduler.
extraArgs:
    feature-gates: AllBeta=true

image string

The container image used in the scheduler manifest.

Examples:

image: k8s.gcr.io/kube-scheduler:v1.23.1

extraArgs map[string]string

Extra arguments to supply to the scheduler.


extraVolumes []VolumeMountConfig

Extra volumes to mount to the scheduler static pod.


EtcdConfig

EtcdConfig represents the etcd configuration options.

Appears in:

image: gcr.io/etcd-development/etcd:v3.5.1 # The container image used to create the etcd service.
# The `ca` is the root certificate authority of the PKI.
ca:
    crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
    key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u
# Extra arguments to supply to etcd.
extraArgs:
    election-timeout: "5000"

# # The subnet from which the advertise URL should be.
# subnet: 10.0.0.0/8

image string

The container image used to create the etcd service.

Examples:

image: gcr.io/etcd-development/etcd:v3.5.1

ca PEMEncodedCertificateAndKey

The ca is the root certificate authority of the PKI. It is composed of a base64 encoded crt and key.

Examples:

ca:
    crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
    key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

extraArgs map[string]string

Extra arguments to supply to etcd. Note that the following args are not allowed:

  • name
  • data-dir
  • initial-cluster-state
  • listen-peer-urls
  • listen-client-urls
  • cert-file
  • key-file
  • trusted-ca-file
  • peer-client-cert-auth
  • peer-cert-file
  • peer-trusted-ca-file
  • peer-key-file

subnet string

The subnet from which the advertise URL should be.

Examples:

subnet: 10.0.0.0/8

ClusterNetworkConfig

ClusterNetworkConfig represents kube networking configuration options.

Appears in:

# The CNI used.
cni:
    name: flannel # Name of CNI to use.
dnsDomain: cluster.local # The domain used by Kubernetes DNS.
# The pod subnet CIDR.
podSubnets:
    - 10.244.0.0/16
# The service subnet CIDR.
serviceSubnets:
    - 10.96.0.0/12

The CNI used. Composed of “name” and “urls”. The “name” key supports the following options: “flannel”, “custom”, and “none”. “flannel” uses Talos-managed Flannel CNI, and that’s the default option. “custom” uses custom manifests that should be provided in “urls”. “none” indicates that Talos will not manage any CNI installation.

Examples:

cni:
    name: custom # Name of CNI to use.
    # URLs containing manifests to apply for the CNI.
    urls:
        - https://docs.projectcalico.org/archive/v3.20/manifests/canal.yaml

dnsDomain string

The domain used by Kubernetes DNS. The default is cluster.local

Examples:

dnsDomain: cluser.local

podSubnets []string

The pod subnet CIDR.

Examples:

podSubnets:
    - 10.244.0.0/16

serviceSubnets []string

The service subnet CIDR.

Examples:

serviceSubnets:
    - 10.96.0.0/12

CNIConfig

CNIConfig represents the CNI configuration options.

Appears in:

name: custom # Name of CNI to use.
# URLs containing manifests to apply for the CNI.
urls:
    - https://docs.projectcalico.org/archive/v3.20/manifests/canal.yaml

name string

Name of CNI to use.

Valid values:

  • flannel

  • custom

  • none


urls []string

URLs containing manifests to apply for the CNI. Should be present for “custom”, must be empty for “flannel” and “none”.


ExternalCloudProviderConfig

ExternalCloudProviderConfig contains external cloud provider configuration.

Appears in:

enabled: true # Enable external cloud provider.
# A list of urls that point to additional manifests for an external cloud provider.
manifests:
    - https://raw.githubusercontent.com/kubernetes/cloud-provider-aws/v1.20.0-alpha.0/manifests/rbac.yaml
    - https://raw.githubusercontent.com/kubernetes/cloud-provider-aws/v1.20.0-alpha.0/manifests/aws-cloud-controller-manager-daemonset.yaml

enabled bool

Enable external cloud provider.

Valid values:

  • true

  • yes

  • false

  • no


manifests []string

A list of urls that point to additional manifests for an external cloud provider. These will get automatically deployed as part of the bootstrap.

Examples:

manifests:
    - https://raw.githubusercontent.com/kubernetes/cloud-provider-aws/v1.20.0-alpha.0/manifests/rbac.yaml
    - https://raw.githubusercontent.com/kubernetes/cloud-provider-aws/v1.20.0-alpha.0/manifests/aws-cloud-controller-manager-daemonset.yaml

AdminKubeconfigConfig

AdminKubeconfigConfig contains admin kubeconfig settings.

Appears in:

certLifetime: 1h0m0s # Admin kubeconfig certificate lifetime (default is 1 year).

certLifetime Duration

Admin kubeconfig certificate lifetime (default is 1 year). Field format accepts any Go time.Duration format (‘1h’ for one hour, ‘10m’ for ten minutes).


MachineDisk

MachineDisk represents the options available for partitioning, formatting, and mounting extra disks.

Appears in:

- device: /dev/sdb # The name of the disk to use.
  # A list of partitions to create on the disk.
  partitions:
    - mountpoint: /var/mnt/extra # Where to mount the partition.

      # # The size of partition: either bytes or human readable representation. If `size:` is omitted, the partition is sized to occupy the full disk.

      # # Human readable representation.
      # size: 100 MB
      # # Precise value in bytes.
      # size: 1073741824

device string

The name of the disk to use.


partitions []DiskPartition

A list of partitions to create on the disk.


DiskPartition

DiskPartition represents the options for a disk partition.

Appears in:


size DiskSize

The size of partition: either bytes or human readable representation. If size: is omitted, the partition is sized to occupy the full disk.

Examples:

size: 100 MB
size: 1073741824

mountpoint string

Where to mount the partition.


EncryptionConfig

EncryptionConfig represents partition encryption settings.

Appears in:


provider string

Encryption provider to use for the encryption.

Examples:

provider: luks2

Defines the encryption keys generation and storage method.


cipher string

Cipher kind to use for the encryption. Depends on the encryption provider.

Valid values:

  • aes-xts-plain64

  • xchacha12,aes-adiantum-plain64

  • xchacha20,aes-adiantum-plain64

Examples:

cipher: aes-xts-plain64

keySize uint

Defines the encryption key length.


blockSize uint64

Defines the encryption sector size.

Examples:

blockSize: 4096

options []string

Additional –perf parameters for the LUKS2 encryption.

Valid values:

  • no_read_workqueue

  • no_write_workqueue

  • same_cpu_crypt

Examples:

options:
    - no_read_workqueue
    - no_write_workqueue

EncryptionKey

EncryptionKey represents configuration for disk encryption key.

Appears in:


Key which value is stored in the configuration file.


Deterministically generated key from the node UUID and PartitionLabel.


slot int

Key slot number for LUKS2 encryption.


EncryptionKeyStatic

EncryptionKeyStatic represents throw away key type.

Appears in:


passphrase string

Defines the static passphrase value.


EncryptionKeyNodeID

EncryptionKeyNodeID represents deterministically generated key from the node UUID and PartitionLabel.

Appears in:

MachineFile

MachineFile represents a file to write to disk.

Appears in:

- content: '...' # The contents of the file.
  permissions: 0o666 # The file's permissions in octal.
  path: /tmp/file.txt # The path of the file.
  op: append # The operation to use

content string

The contents of the file.


permissions FileMode

The file’s permissions in octal.


path string

The path of the file.


op string

The operation to use

Valid values:

  • create

  • append

  • overwrite


ExtraHost

ExtraHost represents a host entry in /etc/hosts.

Appears in:

- ip: 192.168.1.100 # The IP of the host.
  # The host alias.
  aliases:
    - example
    - example.domain.tld

ip string

The IP of the host.


aliases []string

The host alias.


Device

Device represents a network interface.

Appears in:

- interface: eth0 # The interface name.
  # Assigns static IP addresses to the interface.
  addresses:
    - 192.168.2.0/24
  # A list of routes associated with the interface.
  routes:
    - network: 0.0.0.0/0 # The route's network.
      gateway: 192.168.2.1 # The route's gateway.
      metric: 1024 # The optional metric for the route.
  mtu: 1500 # The interface's MTU.

  # # Bond specific options.
  # bond:
  #     # The interfaces that make up the bond.
  #     interfaces:
  #         - eth0
  #         - eth1
  #     mode: 802.3ad # A bond option.
  #     lacpRate: fast # A bond option.

  # # Indicates if DHCP should be used to configure the interface.
  # dhcp: true

  # # DHCP specific options.
  # dhcpOptions:
  #     routeMetric: 1024 # The priority of all routes received via DHCP.

  # # Wireguard specific configuration.

  # # wireguard server example
  # wireguard:
  #     privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
  #     listenPort: 51111 # Specifies a device's listening port.
  #     # Specifies a list of peer configurations to apply to a device.
  #     peers:
  #         - publicKey: ABCDEF... # Specifies the public key of this peer.
  #           endpoint: 192.168.1.3 # Specifies the endpoint of this peer entry.
  #           # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
  #           allowedIPs:
  #             - 192.168.1.0/24
  # # wireguard peer example
  # wireguard:
  #     privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
  #     # Specifies a list of peer configurations to apply to a device.
  #     peers:
  #         - publicKey: ABCDEF... # Specifies the public key of this peer.
  #           endpoint: 192.168.1.2 # Specifies the endpoint of this peer entry.
  #           persistentKeepaliveInterval: 10s # Specifies the persistent keepalive interval for this peer.
  #           # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
  #           allowedIPs:
  #             - 192.168.1.0/24

  # # Virtual (shared) IP address configuration.
  # vip:
  #     ip: 172.16.199.55 # Specifies the IP address to be used.

interface string

The interface name.

Examples:

interface: eth0

addresses []string

Assigns static IP addresses to the interface. An address can be specified either in proper CIDR notation or as a standalone address (netmask of all ones is assumed).

Examples:

addresses:
    - 10.5.0.0/16
    - 192.168.3.7

routes []Route

A list of routes associated with the interface. If used in combination with DHCP, these routes will be appended to routes returned by DHCP server.

Examples:

routes:
    - network: 0.0.0.0/0 # The route's network.
      gateway: 10.5.0.1 # The route's gateway.
    - network: 10.2.0.0/16 # The route's network.
      gateway: 10.2.0.1 # The route's gateway.

bond Bond

Bond specific options.

Examples:

bond:
    # The interfaces that make up the bond.
    interfaces:
        - eth0
        - eth1
    mode: 802.3ad # A bond option.
    lacpRate: fast # A bond option.

vlans []Vlan

VLAN specific options.


mtu int

The interface’s MTU. If used in combination with DHCP, this will override any MTU settings returned from DHCP server.


dhcp bool

Indicates if DHCP should be used to configure the interface. The following DHCP options are supported:

  • OptionClasslessStaticRoute
  • OptionDomainNameServer
  • OptionDNSDomainSearchList
  • OptionHostName

Examples:

dhcp: true

ignore bool

Indicates if the interface should be ignored (skips configuration).


dummy bool

Indicates if the interface is a dummy interface. dummy is used to specify that this interface should be a virtual-only, dummy interface.


dhcpOptions DHCPOptions

DHCP specific options. dhcp must be set to true for these to take effect.

Examples:

dhcpOptions:
    routeMetric: 1024 # The priority of all routes received via DHCP.

Wireguard specific configuration. Includes things like private key, listen port, peers.

Examples:

wireguard:
    privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
    listenPort: 51111 # Specifies a device's listening port.
    # Specifies a list of peer configurations to apply to a device.
    peers:
        - publicKey: ABCDEF... # Specifies the public key of this peer.
          endpoint: 192.168.1.3 # Specifies the endpoint of this peer entry.
          # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
          allowedIPs:
            - 192.168.1.0/24
wireguard:
    privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
    # Specifies a list of peer configurations to apply to a device.
    peers:
        - publicKey: ABCDEF... # Specifies the public key of this peer.
          endpoint: 192.168.1.2 # Specifies the endpoint of this peer entry.
          persistentKeepaliveInterval: 10s # Specifies the persistent keepalive interval for this peer.
          # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
          allowedIPs:
            - 192.168.1.0/24

Virtual (shared) IP address configuration.

Examples:

vip:
    ip: 172.16.199.55 # Specifies the IP address to be used.

DHCPOptions

DHCPOptions contains options for configuring the DHCP settings for a given interface.

Appears in:

routeMetric: 1024 # The priority of all routes received via DHCP.

routeMetric uint32

The priority of all routes received via DHCP.


ipv4 bool

Enables DHCPv4 protocol for the interface (default is enabled).


ipv6 bool

Enables DHCPv6 protocol for the interface (default is disabled).


DeviceWireguardConfig

DeviceWireguardConfig contains settings for configuring Wireguard network interface.

Appears in:

privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
listenPort: 51111 # Specifies a device's listening port.
# Specifies a list of peer configurations to apply to a device.
peers:
    - publicKey: ABCDEF... # Specifies the public key of this peer.
      endpoint: 192.168.1.3 # Specifies the endpoint of this peer entry.
      # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
      allowedIPs:
        - 192.168.1.0/24
privateKey: ABCDEF... # Specifies a private key configuration (base64 encoded).
# Specifies a list of peer configurations to apply to a device.
peers:
    - publicKey: ABCDEF... # Specifies the public key of this peer.
      endpoint: 192.168.1.2 # Specifies the endpoint of this peer entry.
      persistentKeepaliveInterval: 10s # Specifies the persistent keepalive interval for this peer.
      # AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.
      allowedIPs:
        - 192.168.1.0/24

privateKey string

Specifies a private key configuration (base64 encoded). Can be generated by wg genkey.


listenPort int

Specifies a device’s listening port.


firewallMark int

Specifies a device’s firewall mark.


Specifies a list of peer configurations to apply to a device.


DeviceWireguardPeer

DeviceWireguardPeer a WireGuard device peer configuration.

Appears in:


publicKey string

Specifies the public key of this peer. Can be extracted from private key by running wg pubkey < private.key > public.key && cat public.key.


endpoint string

Specifies the endpoint of this peer entry.


persistentKeepaliveInterval Duration

Specifies the persistent keepalive interval for this peer. Field format accepts any Go time.Duration format (‘1h’ for one hour, ‘10m’ for ten minutes).


allowedIPs []string

AllowedIPs specifies a list of allowed IP addresses in CIDR notation for this peer.


DeviceVIPConfig

DeviceVIPConfig contains settings for configuring a Virtual Shared IP on an interface.

Appears in:

ip: 172.16.199.55 # Specifies the IP address to be used.

ip string

Specifies the IP address to be used.


Specifies the Equinix Metal API settings to assign VIP to the node.


Specifies the Hetzner Cloud API settings to assign VIP to the node.


VIPEquinixMetalConfig

VIPEquinixMetalConfig contains settings for Equinix Metal VIP management.

Appears in:


apiToken string

Specifies the Equinix Metal API Token.


VIPHCloudConfig

VIPHCloudConfig contains settings for Hetzner Cloud VIP management.

Appears in:


apiToken string

Specifies the Hetzner Cloud API Token.


Bond

Bond contains the various options for configuring a bonded interface.

Appears in:

# The interfaces that make up the bond.
interfaces:
    - eth0
    - eth1
mode: 802.3ad # A bond option.
lacpRate: fast # A bond option.

interfaces []string

The interfaces that make up the bond.


arpIPTarget []string

A bond option. Please see the official kernel documentation. Not supported at the moment.


mode string

A bond option. Please see the official kernel documentation.


xmitHashPolicy string

A bond option. Please see the official kernel documentation.


lacpRate string

A bond option. Please see the official kernel documentation.


adActorSystem string

A bond option. Please see the official kernel documentation. Not supported at the moment.


arpValidate string

A bond option. Please see the official kernel documentation.


arpAllTargets string

A bond option. Please see the official kernel documentation.


primary string

A bond option. Please see the official kernel documentation.


primaryReselect string

A bond option. Please see the official kernel documentation.


failOverMac string

A bond option. Please see the official kernel documentation.


adSelect string

A bond option. Please see the official kernel documentation.


miimon uint32

A bond option. Please see the official kernel documentation.


updelay uint32

A bond option. Please see the official kernel documentation.


downdelay uint32

A bond option. Please see the official kernel documentation.


arpInterval uint32

A bond option. Please see the official kernel documentation.


resendIgmp uint32

A bond option. Please see the official kernel documentation.


minLinks uint32

A bond option. Please see the official kernel documentation.


lpInterval uint32

A bond option. Please see the official kernel documentation.


packetsPerSlave uint32

A bond option. Please see the official kernel documentation.


numPeerNotif uint8

A bond option. Please see the official kernel documentation.


tlbDynamicLb uint8

A bond option. Please see the official kernel documentation.


allSlavesActive uint8

A bond option. Please see the official kernel documentation.


useCarrier bool

A bond option. Please see the official kernel documentation.


adActorSysPrio uint16

A bond option. Please see the official kernel documentation.


adUserPortKey uint16

A bond option. Please see the official kernel documentation.


peerNotifyDelay uint32

A bond option. Please see the official kernel documentation.


Vlan

Vlan represents vlan settings for a device.

Appears in:


addresses []string

The addresses in CIDR notation or as plain IPs to use.


routes []Route

A list of routes associated with the VLAN.


dhcp bool

Indicates if DHCP should be used.


vlanId uint16

The VLAN’s ID.


mtu uint32

The VLAN’s MTU.


The VLAN’s virtual IP address configuration.


Route

Route represents a network route.

Appears in:

- network: 0.0.0.0/0 # The route's network.
  gateway: 10.5.0.1 # The route's gateway.
- network: 10.2.0.0/16 # The route's network.
  gateway: 10.2.0.1 # The route's gateway.

network string

The route’s network.


gateway string

The route’s gateway.


source string

The route’s source address (optional).


metric uint32

The optional metric for the route.


RegistryMirrorConfig

RegistryMirrorConfig represents mirror configuration for a registry.

Appears in:

ghcr.io:
    # List of endpoints (URLs) for registry mirrors to use.
    endpoints:
        - https://registry.insecure
        - https://ghcr.io/v2/

endpoints []string

List of endpoints (URLs) for registry mirrors to use. Endpoint configures HTTP/HTTPS access mode, host name, port and path (if path is not set, it defaults to /v2).


RegistryConfig

RegistryConfig specifies auth & TLS config per registry.

Appears in:

registry.insecure:
    # The TLS configuration for the registry.
    tls:
        insecureSkipVerify: true # Skip TLS server certificate verification (not recommended).

        # # Enable mutual TLS authentication with the registry.
        # clientIdentity:
        #     crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
        #     key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

    # # The auth configuration for this registry.
    # auth:
    #     username: username # Optional registry authentication.
    #     password: password # Optional registry authentication.

The TLS configuration for the registry.

Examples:

tls:
    # Enable mutual TLS authentication with the registry.
    clientIdentity:
        crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
        key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u
tls:
    insecureSkipVerify: true # Skip TLS server certificate verification (not recommended).

    # # Enable mutual TLS authentication with the registry.
    # clientIdentity:
    #     crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
    #     key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

The auth configuration for this registry.

Examples:

auth:
    username: username # Optional registry authentication.
    password: password # Optional registry authentication.

RegistryAuthConfig

RegistryAuthConfig specifies authentication configuration for a registry.

Appears in:

username: username # Optional registry authentication.
password: password # Optional registry authentication.

username string

Optional registry authentication. The meaning of each field is the same with the corresponding field in .docker/config.json.


password string

Optional registry authentication. The meaning of each field is the same with the corresponding field in .docker/config.json.


auth string

Optional registry authentication. The meaning of each field is the same with the corresponding field in .docker/config.json.


identityToken string

Optional registry authentication. The meaning of each field is the same with the corresponding field in .docker/config.json.


RegistryTLSConfig

RegistryTLSConfig specifies TLS config for HTTPS registries.

Appears in:

# Enable mutual TLS authentication with the registry.
clientIdentity:
    crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
    key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u
insecureSkipVerify: true # Skip TLS server certificate verification (not recommended).

# # Enable mutual TLS authentication with the registry.
# clientIdentity:
#     crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
#     key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

clientIdentity PEMEncodedCertificateAndKey

Enable mutual TLS authentication with the registry. Client certificate and key should be base64-encoded.

Examples:

clientIdentity:
    crt: TFMwdExTMUNSVWRKVGlCRFJWSlVTVVpKUTBGVVJTMHRMUzB0Q2sxSlNVSklla05DTUhGLi4u
    key: TFMwdExTMUNSVWRKVGlCRlJESTFOVEU1SUZCU1NWWkJWRVVnUzBWWkxTMHRMUzBLVFVNLi4u

ca Base64Bytes

CA registry certificate to add the list of trusted certificates. Certificate should be base64-encoded.


insecureSkipVerify bool

Skip TLS server certificate verification (not recommended).


SystemDiskEncryptionConfig

SystemDiskEncryptionConfig specifies system disk partitions encryption settings.

Appears in:

# Ephemeral partition encryption.
ephemeral:
    provider: luks2 # Encryption provider to use for the encryption.
    # Defines the encryption keys generation and storage method.
    keys:
        - # Deterministically generated key from the node UUID and PartitionLabel.
          nodeID: {}
          slot: 0 # Key slot number for LUKS2 encryption.

    # # Cipher kind to use for the encryption. Depends on the encryption provider.
    # cipher: aes-xts-plain64

    # # Defines the encryption sector size.
    # blockSize: 4096

    # # Additional --perf parameters for the LUKS2 encryption.
    # options:
    #     - no_read_workqueue
    #     - no_write_workqueue

State partition encryption.


Ephemeral partition encryption.


FeaturesConfig

FeaturesConfig describe individual Talos features that can be switched on or off.

Appears in:

rbac: true # Enable role-based access control (RBAC).

rbac bool

Enable role-based access control (RBAC).


VolumeMountConfig

VolumeMountConfig struct describes extra volume mount for the static pods.

Appears in:


hostPath string

Path on the host.

Examples:

hostPath: /var/lib/auth

mountPath string

Path in the container.

Examples:

mountPath: /etc/kubernetes/auth

readonly bool

Mount the volume read only.

Examples:

readonly: true

ClusterInlineManifest

ClusterInlineManifest struct describes inline bootstrap manifests for the user.


name string

Name of the manifest. Name should be unique.

Examples:

name: csi

contents string

Manifest contents as a string.

Examples:

contents: /etc/kubernetes/auth

NetworkKubeSpan

NetworkKubeSpan struct describes KubeSpan configuration.

Appears in:

enabled: true # Enable the KubeSpan feature.

enabled bool

Enable the KubeSpan feature. Cluster discovery should be enabled with .cluster.discovery.enabled for KubeSpan to be enabled.


allowDownPeerBypass bool

Skip sending traffic via KubeSpan if the peer connection state is not up. This provides configurable choice between connectivity and security: either traffic is always forced to go via KubeSpan (even if Wireguard peer connection is not up), or traffic can go directly to the peer if Wireguard connection can’t be established.


ClusterDiscoveryConfig

ClusterDiscoveryConfig struct configures cluster membership discovery.

Appears in:

enabled: true # Enable the cluster membership discovery feature.
# Configure registries used for cluster member discovery.
registries:
    # Kubernetes registry uses Kubernetes API server to discover cluster members and stores additional information
    kubernetes: {}
    # Service registry is using an external service to push and pull information about cluster members.
    service:
        endpoint: https://discovery.talos.dev/ # External service endpoint.

enabled bool

Enable the cluster membership discovery feature. Cluster discovery is based on individual registries which are configured under the registries field.


Configure registries used for cluster member discovery.


DiscoveryRegistriesConfig

DiscoveryRegistriesConfig struct configures cluster membership discovery.

Appears in:


Kubernetes registry uses Kubernetes API server to discover cluster members and stores additional information as annotations on the Node resources.


Service registry is using an external service to push and pull information about cluster members.


RegistryKubernetesConfig

RegistryKubernetesConfig struct configures Kubernetes discovery registry.

Appears in:


disabled bool

Disable Kubernetes discovery registry.


RegistryServiceConfig

RegistryServiceConfig struct configures Kubernetes discovery registry.

Appears in:


disabled bool

Disable external service discovery registry.


endpoint string

External service endpoint.

Examples:

endpoint: https://discovery.talos.dev/

UdevConfig

UdevConfig describes how the udev system should be configured.

Appears in:

# List of udev rules to apply to the udev system
rules:
    - SUBSYSTEM=="drm", KERNEL=="renderD*", GROUP="44", MODE="0660"

rules []string

List of udev rules to apply to the udev system


LoggingConfig

LoggingConfig struct configures Talos logging.

Appears in:

# Logging destination.
destinations:
    - endpoint: tcp://1.2.3.4:12345 # Where to send logs. Supported protocols are "tcp" and "udp".
      format: json_lines # Logs format.

destinations []LoggingDestination

Logging destination.


LoggingDestination

LoggingDestination struct configures Talos logging destination.

Appears in:


endpoint Endpoint

Where to send logs. Supported protocols are “tcp” and “udp”.

Examples:

endpoint: udp://127.0.0.1:12345
endpoint: tcp://1.2.3.4:12345

format string

Logs format.

Valid values:

  • json_lines

8.4 - Kernel

Commandline Parameters

Talos supports a number of kernel commandline parameters. Some are required for it to operate. Others are optional and useful in certain circumstances.

Several of these are enforced by the Kernel Self Protection Project KSPP.

Required parameters:

  • talos.config: the HTTP(S) URL at which the machine configuration data can be found
  • talos.platform: can be one of aws, azure, container, digitalocean, gcp, metal, packet, or vmware
  • init_on_alloc=1: required by KSPP
  • slab_nomerge: required by KSPP
  • pti=on: required by KSPP

Recommended parameters:

  • init_on_free=1: advised by KSPP if minimizing stale data lifetime is important

Available Talos-specific parameters

ip

Initial configuration of the interface, routes, DNS, NTP servers.

Full documentation is available in the Linux kernel docs.

ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>:<dns0-ip>:<dns1-ip>:<ntp0-ip>

Talos will use the configuration supplied via the kernel parameter as the initial network configuration. This parameter is useful in the environments where DHCP doesn’t provide IP addresses or when default DNS and NTP servers should be overridden before loading machine configuration. Partial configuration can be applied as well, e.g. ip=<:::::::<dns0-ip>:<dns1-ip>:<ntp0-ip> sets only the DNS and NTP servers.

panic

The amount of time to wait after a panic before a reboot is issued.

Talos will always reboot if it encounters an unrecoverable error. However, when collecting debug information, it may reboot too quickly for humans to read the logs. This option allows the user to delay the reboot to give time to collect debug information from the console screen.

A value of 0 disables automatic rebooting entirely.

talos.config

The URL at which the machine configuration data may be found.

talos.platform

The platform name on which Talos will run.

Valid options are: - aws - azure - container - digitalocean - gcp - metal - packet - vmware

talos.board

The board name, if Talos is being used on an ARM64 SBC.

Supported boards are: - bananapi_m64: Banana Pi M64 - libretech_all_h3_cc_h5: Libre Computer ALL-H3-CC - rock64: Pine64 Rock64 - rpi_4: Raspberry Pi 4, Model B

talos.hostname

The hostname to be used. The hostname is generally specified in the machine config. However, in some cases, the DHCP server needs to know the hostname before the machine configuration has been acquired.

Unless specifically required, the machine configuration should be used instead.

talos.shutdown

The type of shutdown to use when Talos is told to shutdown.

Valid options are: - halt - poweroff

talos.network.interface.ignore

A network interface which should be ignored and not configured by Talos.

Before a configuration is applied (early on each boot), Talos attempts to configure each network interface by DHCP. If there are many network interfaces on the machine which have link but no DHCP server, this can add significant boot delays.

This option may be specified multiple times for multiple network interfaces.

8.5 - Platform

Metal

Below is a image to visualize the process of bootstrapping nodes.

9 - Learn More

9.1 - Philosophy

Distributed

Talos is intended to be operated in a distributed manner. That is, it is built for a high-availability dataplane first. Its etcd cluster is built in an ad-hoc manner, with each appointed node joining on its own directive (with proper security validations enforced, of course). Like as kubernetes itself, workloads are intended to be distributed across any number of compute nodes.

There should be no single points of failure, and the level of required coordination is as low as each platform allows.

Immutable

Talos takes immutability very seriously. Talos itself, even when installed on a disk, always runs from a SquashFS image, meaning that even if a directory is mounted to be writable, the image itself is never modified. All images are signed and delivered as single, versioned files. We can always run integrity checks on our image to verify that it has not been modified.

While Talos does allow a few, highly-controlled write points to the filesystem, we strive to make them as non-unique and non-critical as possible. In fact, we call the writable partition the “ephemeral” partition precisely because we want to make sure none of us ever uses it for unique, non-replicated, non-recreatable data. Thus, if all else fails, we can always wipe the disk and get back up and running.

Minimal

We are always trying to reduce and keep small Talos’ footprint. Because nearly the entire OS is built from scratch in Go, we are already starting out in a good position. We have no shell. We have no SSH. We have none of the GNU utilities, not even a rollup tool such as busybox. Everything which is included in Talos is there because it is necessary, and nothing is included which isn’t.

As a result, the OS right now produces a SquashFS image size of less than 80 MB.

Ephemeral

Everything Talos writes to its disk is either replicated or reconstructable. Since the controlplane is high availability, the loss of any node will cause neither service disruption nor loss of data. No writes are even allowed to the vast majority of the filesystem. We even call the writable partition “ephemeral” to keep this idea always in focus.

Secure

Talos has always been designed with security in mind. With its immutability, its minimalism, its signing, and its componenture, we are able to simply bypass huge classes of vulnerabilities. Moreover, because of the way we have designed Talos, we are able to take advantage of a number of additional settings, such as the recommendations of the Kernel Self Protection Project (kspp) and the complete disablement of dynamic modules.

There are no passwords in Talos. All networked communication is encrypted and key-authenticated. The Talos certificates are short-lived and automatically-rotating. Kubernetes is always constructed with its own separate PKI structure which is enforced.

Declarative

Everything which can be configured in Talos is done so through a single YAML manifest. There is no scripting and no procedural steps. Everything is defined by the one declarative YAML file. This configuration includes that of both Talos itself and the Kubernetes which it forms.

This is achievable because Talos is tightly focused to do one thing: run kubernetes, in the easiest, most secure, most reliable way it can.

9.2 - Architecture

Talos is designed to be atomic in deployment and modular in composition.

It is atomic in the sense that the entirety of Talos is distributed as a single, self-contained image, which is versioned, signed, and immutable.

It is modular in the sense that it is composed of many separate components which have clearly defined gRPC interfaces which facilitate internal flexibility and external operational guarantees.

There are a number of components which comprise Talos. All of the main Talos components communicate with each other by gRPC, through a socket on the local machine. This imposes a clear separation of concerns and ensures that changes over time which affect the interoperation of components are a part of the public git record. The benefit is that each component may be iterated and changed as its needs dictate, so long as the external API is controlled. This is a key component in reducing coupling and maintaining modularity.

File system partitions

Talos uses these partitions with the following labels:

  1. EFI - stores EFI boot data.
  2. BIOS - used for GRUB’s second stage boot.
  3. BOOT - used for the boot loader, stores initramfs and kernel data.
  4. META - stores metadata about the talos node, such as node id’s.
  5. STATE - stores machine configuration, node identity data for cluster discovery and KubeSpan info
  6. EPHEMERAL - stores ephemeral state information, mounted at /var

The File System

One of the more unique design decisions in Talos is the layout of the root file system. There are three “layers” to the Talos root file system. At its’ core the rootfs is a read-only squashfs. The squashfs is then mounted as a loop device into memory. This provides Talos with an immutable base.

The next layer is a set of tmpfs file systems for runtime specific needs. Aside from the standard pseudo file systems such as /dev, /proc, /run, /sys and /tmp, a special /system is created for internal needs. One reason for this is that we need special files such as /etc/hosts, and /etc/resolv.conf to be writable (remember that the rootfs is read-only). For example, at boot Talos will write /system/etc/hosts and the bind mount it over /etc/hosts. This means that instead of making all of /etc writable, Talos only makes very specific files writable under /etc.

All files under /system are completely reproducible. For files and directories that need to persist across boots, Talos creates overlayfs file systems. The /etc/kubernetes is a good example of this. Directories like this are overlayfs backed by an XFS file system mounted at /var.

The /var directory is owned by Kubernetes with the exception of the above overlayfs file systems. This directory is writable and used by etcd (in the case of control plane nodes), the kubelet, and the CRI (containerd).

9.3 - Components

In this section, we discuss the various components that underpin Talos.

Components

ComponentDescription
apidWhen interacting with Talos, the gRPC API endpoint you interact with directly is provided by apid. apid acts as the gateway for all component interactions and forwards the requests to machined.
containerdAn industry-standard container runtime with an emphasis on simplicity, robustness, and portability. To learn more, see the containerd website.
machinedTalos replacement for the traditional Linux init-process. Specially designed to run Kubernetes and does not allow starting arbitrary user services.
networkdHandles all of the host level network configuration. The configuration is defined under the networking key
kernelThe Linux kernel included with Talos is configured according to the recommendations outlined in the Kernel Self Protection Project.
trustdTo run and operate a Kubernetes cluster, a certain level of trust is required. Based on the concept of a ‘Root of Trust’, trustd is a simple daemon responsible for establishing trust within the system.
udevdImplementation of eudev into machined. eudev is Gentoo’s fork of udev, systemd’s device file manager for the Linux kernel. It manages device nodes in /dev and handles all user space actions when adding or removing devices. To learn more, see the Gentoo Wiki.

apid

When interacting with Talos, the gRPC api endpoint you will interact with directly is apid. Apid acts as the gateway for all component interactions. Apid provides a mechanism to route requests to the appropriate destination when running on a control plane node.

We’ll use some examples below to illustrate what apid is doing.

When a user wants to interact with a Talos component via talosctl, there are two flags that control the interaction with apid. The -e | --endpoints flag specifies which Talos node ( via apid ) should handle the connection. Typically this is a public-facing server. The -n | --nodes flag specifies which Talos node(s) should respond to the request. If --nodes is omitted, the first endpoint will be used.

Note: Typically, there will be an endpoint already defined in the Talos config file. Optionally, nodes can be included here as well.

For example, if a user wants to interact with machined, a command like talosctl -e cluster.talos.dev memory may be used.

$ talosctl -e cluster.talos.dev memory
NODE                TOTAL   USED   FREE   SHARED   BUFFERS   CACHE   AVAILABLE
cluster.talos.dev   7938    1768   2390   145      53        3724    6571

In this case, talosctl is interacting with apid running on cluster.talos.dev and forwarding the request to the machined api.

If we wanted to extend our example to retrieve memory from another node in our cluster, we could use the command talosctl -e cluster.talos.dev -n node02 memory.

$ talosctl -e cluster.talos.dev -n node02 memory
NODE    TOTAL   USED   FREE   SHARED   BUFFERS   CACHE   AVAILABLE
node02  7938    1768   2390   145      53        3724    6571

The apid instance on cluster.talos.dev receives the request and forwards it to apid running on node02, which forwards the request to the machined api.

We can further extend our example to retrieve memory for all nodes in our cluster by appending additional -n node flags or using a comma separated list of nodes ( -n node01,node02,node03 ):

$ talosctl -e cluster.talos.dev -n node01 -n node02 -n node03 memory
NODE     TOTAL    USED    FREE     SHARED   BUFFERS   CACHE   AVAILABLE
node01   7938     871     4071     137      49        2945    7042
node02   257844   14408   190796   18138    49        52589   227492
node03   257844   1830    255186   125      49        777     254556

The apid instance on cluster.talos.dev receives the request and forwards it to node01, node02, and node03, which then forwards the request to their local machined api.

containerd

Containerd provides the container runtime to launch workloads on Talos and Kubernetes.

Talos services are namespaced under the system namespace in containerd, whereas the Kubernetes services are namespaced under the k8s.io namespace.

machined

A common theme throughout the design of Talos is minimalism. We believe strongly in the UNIX philosophy that each program should do one job well. The init included in Talos is one example of this, and we are calling it “machined”.

We wanted to create a focused init that had one job - run Kubernetes. To that extent, machined is relatively static in that it does not allow for arbitrary user-defined services. Only the services necessary to run Kubernetes and manage the node are available. This includes:

  • containerd
  • kubelet
  • networkd
  • trustd
  • udevd

networkd

Networkd handles all of the host level network configuration. The configuration is defined under the networking key.

By default, we attempt to issue a DHCP request for every interface on the server. This can be overridden by supplying one of the following kernel arguments:

  • talos.network.interface.ignore - specify a list of interfaces to skip discovery on
  • ip - ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>:<dns0-ip>:<dns1-ip>:<ntp0-ip> as documented in the kernel here
    • ex, ip=10.0.0.99:::255.0.0.0:control-1:eth0:off:10.0.0.1

kernel

The Linux kernel included with Talos is configured according to the recommendations outlined in the Kernel Self Protection Project (KSSP).

trustd

Security is one of the highest priorities within Talos. To run a Kubernetes cluster, a certain level of trust is required to operate a cluster. For example, orchestrating the bootstrap of a highly available control plane requires sensitive PKI data distribution.

To that end, we created trustd. Based on a Root of Trust concept, trustd is a simple daemon responsible for establishing trust within the system. Once trust is established, various methods become available to the trustee. For example, it can accept a write request from another node to place a file on disk.

Additional methods and capabilities will be added to the trustd component to support new functionality in the rest of the Talos environment.

udevd

Udevd handles the kernel device notifications and sets up the necessary links in /dev.

9.4 - Upgrades

Talos

The upgrade process for Talos, like everything else, begins with an API call. This call tells a node the installer image to use to perform the upgrade. Each Talos version corresponds to an installer with the same version, such that the version of the installer is the version of Talos which will be installed.

Because Talos is image based, even at run-time, upgrading Talos is almost exactly the same set of operations as installing Talos, with the difference that the system has already been initialized with a configuration.

An upgrade makes use of an A-B image scheme in order to facilitate rollbacks. This scheme retains the one previous Talos kernel and OS image following each upgrade. If an upgrade fails to boot, Talos will roll back to the previous version. Likewise, Talos may be manually rolled back via API (or talosctl rollback). This will simply update the boot reference and reboot.

An upgrade can preserve data or not. If Talos is told to NOT preserve data, it will wipe its ephemeral partition, remove itself from the etcd cluster (if it is a control node), and generally make itself as pristine as is possible. There are likely to be changes to the default option here over time, so if your setup has a preference to one way or the other, it is better to specify it explicitly, but we try to always be “safe” with this setting.

Sequence

When a Talos node receives the upgrade command, the first thing it does is cordon itself in kubernetes, to avoid receiving any new workload. It then starts to drain away its existing workload.

NOTE: If any of your workloads is sensitive to being shut down ungracefully, be sure to use the lifecycle.preStop Pod spec.

Once all of the workload Pods are drained, Talos will start shutting down its internal processes. If it is a control node, this will include etcd. If preserve is not enabled, Talos will even leave etcd membership. (Don’t worry about this; we make sure the etcd cluster is healthy and that it will remain healthy after our node departs, before we allow this to occur.)

Once all the processes are stopped and the services are shut down, all of the filesystems will be unmounted. This allows Talos to produce a very clean upgrade, as close as possible to a pristine system. We verify the disk and then perform the actual image upgrade.

Finally, we tell the bootloader to boot once with the new kernel and OS image. Then we reboot.

After the node comes back up and Talos verifies itself, it will make permanent the bootloader change, rejoin the cluster, and finally uncordon itself to receive new workloads.

FAQs

Q. What happens if an upgrade fails?

A. There are many potential ways an upgrade can fail, but we always try to do the safe thing.

The most common first failure is an invalid installer image reference. In this case, Talos will fail to download the upgraded image and will abort the upgrade.

Sometimes, Talos is unable to successfully kill off all of the disk access points, in which case it cannot safely unmount all filesystems to effect the upgrade. In this case, it will abort the upgrade and reboot.

It is possible (especially with test builds) that the upgraded Talos system will fail to start. In this case, the node will be rebooted, and the bootloader will automatically use the previous Talos kernel and image, thus effectively aborting the upgrade.

Lastly, it is possible that Talos itself will upgrade successfully, start up, and rejoin the cluster but your workload will fail to run on it, for whatever reason. This is when you would use the talosctl rollback command to revert back to the previous Talos version.

Q. Can upgrades be scheduled?

A. We provide the Talos Controller Manager to coordinate upgrades of a cluster. Additionally, because the upgrade sequence is API-driven, you can easily tie this in to your own business logic to schedule and coordinate your upgrades.

Q. Can the upgrade process be observed?

A. The Talos Controller Manager does this internally, watching the logs of the node being upgraded, using the streaming log API of Talos.

You can do the same thing using the talosctl logs --follow machined command.

Q. Are worker node upgrades handled differently from control plane node upgrades?

A. Short answer: no.

Long answer: Both node types follow the same set procedure. However, since control plane nodes run additional services, such as etcd, there are some extra steps and checks performed on them. From the user’s standpoint, however, the processes are identical.

There are also additional restrictions on upgrading control plane nodes. For instance, Talos will refuse to upgrade a control plane node if that upgrade will cause a loss of quorum for etcd. This can generally be worked around by setting preserve to true.

Q. Will an upgrade try to do the whole cluster at once? Can I break my cluster by upgrading everything?

A. No.

Nothing prevents the user from sending any number of near-simultaneous upgrades to each node of the cluster. While most people would not attempt to do this, it may be the desired behaviour in certain situations.

If, however, multiple control plane nodes are asked to upgrade at the same time, Talos will protect itself by making sure only one control plane node upgrades at any time, through its checking of etcd quorum. A lease is taken out by the winning control plane node, and no other control plane node is allowed to execute the upgrade until the lease is released and the etcd cluster is healthy and will be healthy when the next node performs its upgrade.

Q. Is there an operator or controller which will keep my nodes updated automatically?

A. Yes.

We provide the Talos Controller Manager to perform this maintenance in a simple, controllable fashion.

9.5 - FAQs

How is Talos different from other container optimized Linux distros?

Talos shares a lot of attributes with other distros, but there are some important differences. Talos integrates tightly with Kubernetes, and is not meant to be a general-purpose operating system. The most important difference is that Talos is fully controlled by an API via a gRPC interface, instead of an ordinary shell. We don’t ship SSH, and there is no console access. Removing components such as these has allowed us to dramatically reduce the footprint of Talos, and in turn, improve a number of other areas like security, predictability, reliability, and consistency across platforms. It’s a big change from how operating systems have been managed in the past, but we believe that API-driven OSes are the future.

Why no shell or SSH?

Since Talos is fully API-driven, all maintenance and debugging operations should be possible via the OS API. We would like for Talos users to start thinking about what a “machine” is in the context of a Kubernetes cluster. That is, that a Kubernetes cluster can be thought of as one massive machine, and the nodes are merely additional, undifferentiated resources. We don’t want humans to focus on the nodes, but rather on the machine that is the Kubernetes cluster. Should an issue arise at the node level, talosctl should provide the necessary tooling to assist in the identification, debugging, and remediation of the issue. However, the API is based on the Principle of Least Privilege, and exposes only a limited set of methods. We envision Talos being a great place for the application of control theory in order to provide a self-healing platform.

Why the name “Talos”?

Talos was an automaton created by the Greek God of the forge to protect the island of Crete. He would patrol the coast and enforce laws throughout the land. We felt it was a fitting name for a security focused operating system designed to run Kubernetes.

9.6 - talosctl

The talosctl tool packs a lot of power into a small package. It acts as a reference implementation for the Talos API, but it also handles a lot of conveniences for the use of Talos and its clusters.

Video Walkthrough

To see some live examples of talosctl usage, view the following video:

Client Configuration

Talosctl configuration is located in $XDG_CONFIG_HOME/talos/config.yaml if $XDG_CONFIG_HOME is defined. Otherwise it is in $HOME/.talos/config. The location can always be overridden by the TALOSCONFIG environment variable or the --talosconfig parameter.

Like kubectl, talosctl uses the concept of configuration contexts, so any number of Talos clusters can be managed with a single configuration file. Unlike kubectl, it also comes with some intelligent tooling to manage the merging of new contexts into the config. The default operation is a non-destructive merge, where if a context of the same name already exists in the file, the context to be added is renamed by appending an index number. You can easily overwrite instead, as well. See the talosctl config help for more information.

Endpoints and Nodes

Endpoints and Nodes

The endpoints are the communication endpoints to which the client directly talks. These can be load balancers, DNS hostnames, a list of IPs, etc. Further, if multiple endpoints are specified, the client will automatically load balance and fail over between them. In general, it is recommended that these point to the set of control plane nodes, either directly or through a reverse proxy or load balancer.

Each endpoint will automatically proxy requests destined to another node through it, so it is not necessary to change the endpoint configuration just because you wish to talk to a different node within the cluster.

Endpoints do, however, need to be members of the same Talos cluster as the target node, because these proxied connections reply on certificate-based authentication.

The node is the target node on which you wish to perform the API call. While you can configure the target node (or even set of target nodes) inside the ’talosctl’ configuration file, it is often useful to simply and explicitly declare the target node(s) using the -n or --nodes command-line parameter.

Keep in mind, when specifying nodes that their IPs and/or hostnames are as seen by the endpoint servers, not as from the client. This is because all connections are proxied first through the endpoints.

Kubeconfig

The configuration for accessing a Talos Kubernetes cluster is obtained with talosctl. By default, talosctl will safely merge the cluster into the default kubeconfig. Like talosctl itself, in the event of a naming conflict, the new context name will be index-appended before insertion. The --force option can be used to overwrite instead.

You can also specify an alternate path by supplying it as a positional parameter.

Thus, like Talos clusters themselves, talosctl makes it easy to manage any number of kubernetes clusters from the same workstation.

Commands

Please see the CLI reference for the entire list of commands which are available from talosctl.

9.7 - Control Plane

This guide provides details on how Talos runs and bootstraps the Kubernetes control plane.

High-level Overview

Talos cluster bootstrap flow:

  1. The etcd service is started on control plane nodes. Instances of etcd on control plane nodes build the etcd cluster.
  2. The kubelet service is started.
  3. Control plane components are started as static pods via the kubelet, and the kube-apiserver component connects to the local (running on the same node) etcd instance.
  4. The kubelet issues client certificate using the bootstrap token using the control plane endpoint (via kube-apiserver and kube-controller-manager).
  5. The kubelet registers the node in the API server.
  6. Kubernetes control plane schedules pods on the nodes.

Cluster Bootstrapping

All nodes start the kubelet service. The kubelet tries to contact the control plane endpoint, but as it is not up yet, it keeps retrying.

One of the control plane nodes is chosen as the bootstrap node. The node’s type can be either init or controlplane, where the controlplane type is promoted using the bootstrap API (talosctl bootstrap). The bootstrap node initiates the etcd bootstrap process by initializing etcd as the first member of the cluster.

Note: there should be only one bootstrap node for the cluster lifetime. Once etcd is bootstrapped, the bootstrap node has no special role and acts the same way as other control plane nodes.

Services etcd on non-bootstrap nodes try to get Endpoints resource via control plane endpoint, but that request fails as control plane endpoint is not up yet.

As soon as etcd is up on the bootstrap node, static pod definitions for the Kubernetes control plane components (kube-apiserver, kube-controller-manager, kube-scheduler) are rendered to disk. The kubelet service on the bootstrap node picks up the static pod definitions and starts the Kubernetes control plane components. As soon as kube-apiserver is launched, the control plane endpoint comes up.

The bootstrap node acquires an etcd mutex and injects the bootstrap manifests into the API server. The set of the bootstrap manifests specify the Kubernetes join token and kubelet CSR auto-approval. The kubelet service on all the nodes is now able to issue client certificates for themselves and register nodes in the API server.

Other bootstrap manifests specify additional resources critical for Kubernetes operations (i.e. CNI, PSP, etc.)

The etcd service on non-bootstrap nodes is now able to discover other members of the etcd cluster via the Kubernetes Endpoints resource. The etcd cluster is now formed and consists of all control plane nodes.

All control plane nodes render static pod manifests for the control plane components. Each node now runs a full set of components to make the control plane HA.

The kubelet service on worker nodes is now able to issue the client certificate and register itself with the API server.

Scaling Up the Control Plane

When new nodes are added to the control plane, the process is the same as the bootstrap process above: the etcd service discovers existing members of the control plane via the control plane endpoint, joins the etcd cluster, and the control plane components are scheduled on the node.

Scaling Down the Control Plane

Scaling down the control plane involves removing a node from the cluster. The most critical part is making sure that the node which is being removed leaves the etcd cluster. When using talosctl reset command, the targeted control plane node leaves the etcd cluster as part of the reset sequence.

Upgrading Control Plane Nodes

When a control plane node is upgraded, Talos leaves etcd, wipes the system disk, installs a new version of itself, and reboots. The upgraded node then joins the etcd cluster on reboot. So upgrading a control plane node is equivalent to scaling down the control plane node followed by scaling up with a new version of Talos.

9.8 - Controllers and Resources

Talos implements concepts of resources and controllers to facilitate internal operations of the operating system. Talos resources and controllers are very similar to Kubernetes resources and controllers, but there are some differences. The content of this document is not required to operate Talos, but it is useful for troubleshooting.

Starting with Talos 0.9, most of the Kubernetes control plane boostrapping and operations is implemented via controllers and resources which allows Talos to be reactive to configuration changes, environment changes (e.g. time sync).

Resources

A resource captures a piece of system state. Each resource belongs to a “Type” which defines resource contents. Resource state can be split in two parts:

  • metadata: fixed set of fields describing resource - namespace, type, ID, etc.
  • spec: contents of the resource (depends on resource type).

Resource is uniquely identified by (namespace, type, id). Namespaces provide a way to avoid conflicts on duplicate resource IDs.

At the moment of this writing, all resources are local to the node and stored in memory. So on every reboot resource state is rebuilt from scratch (the only exception is MachineConfig resource which reflects current machine config).

Controllers

Controllers run as independent lightweight threads in Talos. The goal of the controller is to reconcile the state based on inputs and eventually update outputs.

A controller can have any number of resource types (and namespaces) as inputs. In other words, it watches specified resources for changes and reconciles when these changes occur. A controller might also have additional inputs: running reconcile on schedule, watching etcd keys, etc.

A controller has a single output: a set of resources of fixed type in a fixed namespace. Only one controller can manage resource type in the namespace, so conflicts are avoided.

Querying Resources

Talos CLI tool talosctl provides read-only access to the resource API which includes getting specific resource, listing resources and watching for changes.

Talos stores resources describing resource types and namespaces in meta namespace:

$ talosctl get resourcedefinitions
NODE         NAMESPACE   TYPE                 ID                                               VERSION
172.20.0.2   meta        ResourceDefinition   bootstrapstatuses.v1alpha1.talos.dev             1
172.20.0.2   meta        ResourceDefinition   etcdsecrets.secrets.talos.dev                    1
172.20.0.2   meta        ResourceDefinition   kubernetescontrolplaneconfigs.config.talos.dev   1
172.20.0.2   meta        ResourceDefinition   kubernetessecrets.secrets.talos.dev              1
172.20.0.2   meta        ResourceDefinition   machineconfigs.config.talos.dev                  1
172.20.0.2   meta        ResourceDefinition   machinetypes.config.talos.dev                    1
172.20.0.2   meta        ResourceDefinition   manifests.kubernetes.talos.dev                   1
172.20.0.2   meta        ResourceDefinition   manifeststatuses.kubernetes.talos.dev            1
172.20.0.2   meta        ResourceDefinition   namespaces.meta.cosi.dev                         1
172.20.0.2   meta        ResourceDefinition   resourcedefinitions.meta.cosi.dev                1
172.20.0.2   meta        ResourceDefinition   rootsecrets.secrets.talos.dev                    1
172.20.0.2   meta        ResourceDefinition   secretstatuses.kubernetes.talos.dev              1
172.20.0.2   meta        ResourceDefinition   services.v1alpha1.talos.dev                      1
172.20.0.2   meta        ResourceDefinition   staticpods.kubernetes.talos.dev                  1
172.20.0.2   meta        ResourceDefinition   staticpodstatuses.kubernetes.talos.dev           1
172.20.0.2   meta        ResourceDefinition   timestatuses.v1alpha1.talos.dev                  1
$ talosctl get namespaces
NODE         NAMESPACE   TYPE        ID             VERSION
172.20.0.2   meta        Namespace   config         1
172.20.0.2   meta        Namespace   controlplane   1
172.20.0.2   meta        Namespace   meta           1
172.20.0.2   meta        Namespace   runtime        1
172.20.0.2   meta        Namespace   secrets        1

Most of the time namespace flag (--namespace) can be omitted, as ResourceDefinition contains default namespace which is used if no namespace is given:

$ talosctl get resourcedefinitions resourcedefinitions.meta.cosi.dev -o yaml
node: 172.20.0.2
metadata:
    namespace: meta
    type: ResourceDefinitions.meta.cosi.dev
    id: resourcedefinitions.meta.cosi.dev
    version: 1
    phase: running
spec:
    type: ResourceDefinitions.meta.cosi.dev
    displayType: ResourceDefinition
    aliases:
        - resourcedefinitions
        - resourcedefinition
        - resourcedefinitions.meta
        - resourcedefinitions.meta.cosi
        - rd
        - rds
    printColumns: []
    defaultNamespace: meta

Resource definition also contains type aliases which can be used interchangeably with canonical resource name:

$ talosctl get ns config
NODE         NAMESPACE   TYPE        ID             VERSION
172.20.0.2   meta        Namespace   config         1

Output

Command talosctl get supports following output modes:

  • table (default) prints resource list as a table
  • yaml prints pretty formatted resources with details, including full metadata spec. This format carries most details from the backend resource (e.g. comments in MachineConfig resource)
  • json prints same information as yaml, some additional details (e.g. comments) might be lost. This format is useful for automated processing with tools like jq.

Watching Changes

If flag --watch is appended to the talosctl get command, the command switches to watch mode. If list of resources was requested, talosctl prints initial contents of the list and then appends resource information for every change:

$ talosctl get svc -w
NODE         *   NAMESPACE   TYPE      ID     VERSION   RUNNING   HEALTHY
172.20.0.2   +   runtime   Service   timed   2   true   true
172.20.0.2   +   runtime   Service   trustd   2   true   true
172.20.0.2   +   runtime   Service   udevd   2   true   true
172.20.0.2   -   runtime   Service   timed   2   true   true
172.20.0.2   +   runtime   Service   timed   1   true   false
172.20.0.2       runtime   Service   timed   2   true   true

Column * specifies event type:

  • + is created
  • - is deleted
  • is updated

In YAML/JSON output, field event is added to the resource representation to describe the event type.

Examples

Getting machine config:

$ talosctl get machineconfig -o yaml
node: 172.20.0.2
metadata:
    namespace: config
    type: MachineConfigs.config.talos.dev
    id: v1alpha1
    version: 2
    phase: running
spec:
    version: v1alpha1 # Indicates the schema used to decode the contents.
    debug: false # Enable verbose logging to the console.
    persist: true # Indicates whether to pull the machine config upon every boot.
    # Provides machine specific configuration options.
...

Getting control plane static pod statuses:

$ talosctl get staticpodstatus
NODE         NAMESPACE      TYPE              ID                                                           VERSION   READY
172.20.0.2   controlplane   StaticPodStatus   kube-system/kube-apiserver-talos-default-master-1            3         True
172.20.0.2   controlplane   StaticPodStatus   kube-system/kube-controller-manager-talos-default-master-1   3         True
172.20.0.2   controlplane   StaticPodStatus   kube-system/kube-scheduler-talos-default-master-1            4         True

Getting static pod definition for kube-apiserver:

$ talosctl get sp kube-apiserver -n 172.20.0.2 -o yaml
node: 172.20.0.2
metadata:
    namespace: controlplane
    type: StaticPods.kubernetes.talos.dev
    id: kube-apiserver
    version: 3
    phase: running
    finalizers:
        - k8s.StaticPodStatus("kube-apiserver")
spec:
    apiVersion: v1
    kind: Pod
    metadata:
        annotations:
            talos.dev/config-version: "1"
            talos.dev/secrets-version: "2"
...

Inspecting Controller Dependencies

Talos can report current dependencies between controllers and resources for debugging purposes:

$ talosctl inspect dependencies
digraph  {

  n1[label="config.K8sControlPlaneController",shape="box"];
  n3[label="config.MachineTypeController",shape="box"];
  n2[fillcolor="azure2",label="config:KubernetesControlPlaneConfigs.config.talos.dev",shape="note",style="filled"];
...

This outputs graph in graphviz format which can be rendered to PNG with command:

talosctl inspect dependencies | dot -T png > deps.png

Controller Dependencies

Graph can be enhanced by replacing resource types with actual resource instances:

talosctl inspect dependencies --with-resources | dot -T png > deps.png

Controller Dependencies with Resources

9.9 - Networking Resources

Starting with version 0.11, a new implementation of the network configuration subsystem is powered by COSI. The new implementation is still using the same machine configuration file format and external sources to configure a node’s network, so there should be no difference in the way Talos works in 0.11.

The most notable change in Talos 0.11 is that all changes to machine configuration .machine.network can be applied now in immediate mode (without a reboot) via talosctl edit mc --immediate or talosctl apply-config --immediate.

Resources

There are six basic network configuration items in Talos:

  • Address (IP address assigned to the interface/link);
  • Route (route to a destination);
  • Link (network interface/link configuration);
  • Resolver (list of DNS servers);
  • Hostname (node hostname and domainname);
  • TimeServer (list of NTP servers).

Each network configuration item has two counterparts:

  • *Status (e.g. LinkStatus) describes the current state of the system (Linux kernel state);
  • *Spec (e.g. LinkSpec) defines the desired configuration.
ResourceStatusSpec
AddressAddressStatusAddressSpec
RouteRouteStatusRouteSpec
LinkLinkStatusLinkSpec
ResolverResolverStatusResolverSpec
HostnameHostnameStatusHostnameSpec
TimeServerTimeServerStatusTimeServerSpec

Status resources have aliases with the Status suffix removed, so for example AddressStatus is also available as Address.

Talos networking controllers reconcile the state so that *Status equals the desired *Spec.

Observing State

The current network configuration state can be observed by querying *Status resources via talosctl:

$ talosctl get addresses
NODE         NAMESPACE   TYPE            ID                                       VERSION   ADDRESS                        LINK
172.20.0.2   network     AddressStatus   eth0/172.20.0.2/24                       1         172.20.0.2/24                  eth0
172.20.0.2   network     AddressStatus   eth0/fe80::9804:17ff:fe9d:3058/64        2         fe80::9804:17ff:fe9d:3058/64   eth0
172.20.0.2   network     AddressStatus   flannel.1/10.244.4.0/32                  1         10.244.4.0/32                  flannel.1
172.20.0.2   network     AddressStatus   flannel.1/fe80::10b5:44ff:fe62:6fb8/64   2         fe80::10b5:44ff:fe62:6fb8/64   flannel.1
172.20.0.2   network     AddressStatus   lo/127.0.0.1/8                           1         127.0.0.1/8                    lo
172.20.0.2   network     AddressStatus   lo/::1/128                               1         ::1/128                        lo

In the output there are addresses set up by Talos (e.g. eth0/172.20.0.2/24) and addresses set up by other facilities (e.g. flannel.1/10.244.4.0/32 set up by CNI).

Talos networking controllers watch the kernel state and update resources accordingly.

Additional details about the address can be accessed via the YAML output:

$ talosctl get address eth0/172.20.0.2/24 -o yaml
node: 172.20.0.2
metadata:
    namespace: network
    type: AddressStatuses.net.talos.dev
    id: eth0/172.20.0.2/24
    version: 1
    owner: network.AddressStatusController
    phase: running
    created: 2021-06-29T20:23:18Z
    updated: 2021-06-29T20:23:18Z
spec:
    address: 172.20.0.2/24
    local: 172.20.0.2
    broadcast: 172.20.0.255
    linkIndex: 4
    linkName: eth0
    family: inet4
    scope: global
    flags: permanent

Resources can be watched for changes with the --watch flag to see how configuration changes over time.

Other networking status resources can be inspected with talosctl get routes, talosctl get links, etc. For example:

$ talosctl get resolvers
NODE         NAMESPACE   TYPE             ID          VERSION   RESOLVERS
172.20.0.2   network     ResolverStatus   resolvers   2         ["8.8.8.8","1.1.1.1"]

Inspecting Configuration

The desired networking configuration is combined from multiple sources and presented as *Spec resources:

$ talosctl get addressspecs
NODE         NAMESPACE   TYPE          ID                   VERSION
172.20.0.2   network     AddressSpec   eth0/172.20.0.2/24   2
172.20.0.2   network     AddressSpec   lo/127.0.0.1/8       2
172.20.0.2   network     AddressSpec   lo/::1/128           2

These AddressSpecs are applied to the Linux kernel to reach the desired state. If, for example, an AddressSpec is removed, the address is removed from the Linux network interface as well.

*Spec resources can’t be manipulated directly, they are generated automatically by Talos from multiple configuration sources (see a section below for details).

If a *Spec resource is queried in YAML format, some additional information is available:

$ talosctl get addressspecs eth0/172.20.0.2/24 -o yaml
node: 172.20.0.2
metadata:
    namespace: network
    type: AddressSpecs.net.talos.dev
    id: eth0/172.20.0.2/24
    version: 2
    owner: network.AddressMergeController
    phase: running
    created: 2021-06-29T20:23:18Z
    updated: 2021-06-29T20:23:18Z
    finalizers:
        - network.AddressSpecController
spec:
    address: 172.20.0.2/24
    linkName: eth0
    family: inet4
    scope: global
    flags: permanent
    layer: operator

An important field is the layer field, which describes a configuration layer this spec is coming from: in this case, it’s generated by a network operator (see below) and is set by the DHCPv4 operator.

Configuration Merging

Spec resources described in the previous section show the final merged configuration state, while initial specs are put to a different unmerged namespace network-config. Spec resources in the network-config namespace are merged with conflict resolution to produce the final merged representation in the network namespace.

Let’s take HostnameSpec as an example. The final merged representation is:

$ talosctl get hostnamespec -o yaml
node: 172.20.0.2
metadata:
    namespace: network
    type: HostnameSpecs.net.talos.dev
    id: hostname
    version: 2
    owner: network.HostnameMergeController
    phase: running
    created: 2021-06-29T20:23:18Z
    updated: 2021-06-29T20:23:18Z
    finalizers:
        - network.HostnameSpecController
spec:
    hostname: talos-default-master-1
    domainname: ""
    layer: operator

We can see that the final configuration for the hostname is talos-default-master-1. And this is the hostname that was actually applied. This can be verified by querying a HostnameStatus resource:

$ talosctl get hostnamestatus
NODE         NAMESPACE   TYPE             ID         VERSION   HOSTNAME                 DOMAINNAME
172.20.0.2   network     HostnameStatus   hostname   1         talos-default-master-1

Initial configuration for the hostname in the network-config namespace is:

$ talosctl get hostnamespec -o yaml --namespace network-config
node: 172.20.0.2
metadata:
    namespace: network-config
    type: HostnameSpecs.net.talos.dev
    id: default/hostname
    version: 2
    owner: network.HostnameConfigController
    phase: running
    created: 2021-06-29T20:23:18Z
    updated: 2021-06-29T20:23:18Z
spec:
    hostname: talos-172-20-0-2
    domainname: ""
    layer: default
---
node: 172.20.0.2
metadata:
    namespace: network-config
    type: HostnameSpecs.net.talos.dev
    id: dhcp4/eth0/hostname
    version: 1
    owner: network.OperatorSpecController
    phase: running
    created: 2021-06-29T20:23:18Z
    updated: 2021-06-29T20:23:18Z
spec:
    hostname: talos-default-master-1
    domainname: ""
    layer: operator

We can see that there are two specs for the hostname:

  • one from the default configuration layer which defines the hostname as talos-172-20-0-2 (default driven by the default node address);
  • another one from the layer operator that defines the hostname as talos-default-master-1 (DHCP).

Talos merges these two specs into a final HostnameSpec based on the configuration layer and merge rules. Here is the order of precedence from low to high:

  • default (defaults provided by Talos);
  • cmdline (from the kernel command line);
  • platform (driven by the cloud provider);
  • operator (various dynamic configuration options: DHCP, Virtual IP, etc);
  • configuration (derived from the machine configuration).

So in our example the operator layer HostnameSpec overwrites the default layer producing the final hostname talos-default-master-1.

The merge process applies to all six core networking specs. For each spec, the layer controls the merge behavior If multiple configuration specs appear at the same layer, they can be merged together if possible, otherwise merge result is stable but not defined (e.g. if DHCP on multiple interfaces provides two different hostnames for the node).

LinkSpecs are merged across layers, so for example, machine configuration for the interface MTU overrides an MTU set by the DHCP server.

Network Operators

Network operators provide dynamic network configuration which can change over time as the node is running:

  • DHCPv4
  • DHCPv6
  • Virtual IP

Network operators produce specs for addresses, routes, links, etc., which are then merged and applied according to the rules described above.

Operators are configured with OperatorSpec resources which describe when operators should run and additional configuration for the operator:

$ talosctl get operatorspecs -o yaml
node: 172.20.0.2
metadata:
    namespace: network
    type: OperatorSpecs.net.talos.dev
    id: dhcp4/eth0
    version: 1
    owner: network.OperatorConfigController
    phase: running
    created: 2021-06-29T20:23:18Z
    updated: 2021-06-29T20:23:18Z
spec:
    operator: dhcp4
    linkName: eth0
    requireUp: true
    dhcp4:
        routeMetric: 1024

OperatorSpec resources are generated by Talos based on machine configuration mostly. DHCP4 operator is created automatically for all physical network links which are not configured explicitly via the kernel command line or the machine configuration. This also means that on the first boot, without a machine configuration, a DHCP request is made on all physical network interfaces by default.

Specs generated by operators are prefixed with the operator ID (dhcp4/eth0 in the example above) in the unmerged network-config namespace:

$ talosctl -n 172.20.0.2 get addressspecs --namespace network-config
NODE         NAMESPACE        TYPE          ID                              VERSION
172.20.0.2   network-config   AddressSpec   dhcp4/eth0/eth0/172.20.0.2/24   1

Other Network Resources

There are some additional resources describing the network subsystem state.

The NodeAddress resource presents node addresses excluding link-local and loopback addresses:

$ talosctl get nodeaddresses
NODE          NAMESPACE   TYPE          ID             VERSION   ADDRESSES
10.100.2.23   network     NodeAddress   accumulative   6         ["10.100.2.23","147.75.98.173","147.75.195.143","192.168.95.64","2604:1380:1:ca00::17"]
10.100.2.23   network     NodeAddress   current        5         ["10.100.2.23","147.75.98.173","192.168.95.64","2604:1380:1:ca00::17"]
10.100.2.23   network     NodeAddress   default        1         ["10.100.2.23"]
  • default is the node default address;
  • current is the set of addresses a node currently has;
  • accumulative is the set of addresses a node had over time (it might include virtual IPs which are not owned by the node at the moment).

NodeAddress resources are used to pick up the default address for etcd peer URL, to populate SANs field in the generated certificates, etc.

Another important resource is Nodename which provides Node name in Kubernetes:

$ talosctl get nodename
NODE          NAMESPACE      TYPE       ID         VERSION   NODENAME
10.100.2.23   controlplane   Nodename   nodename   1         infra-green-cp-mmf7v

Depending on the machine configuration nodename might be just a hostname or the FQDN of the node.

NetworkStatus aggregates the current state of the network configuration:

$ talosctl get networkstatus -o yaml
node: 10.100.2.23
metadata:
    namespace: network
    type: NetworkStatuses.net.talos.dev
    id: status
    version: 5
    owner: network.StatusController
    phase: running
    created: 2021-06-24T18:56:00Z
    updated: 2021-06-24T18:56:02Z
spec:
    addressReady: true
    connectivityReady: true
    hostnameReady: true
    etcFilesReady: true

Network Controllers

For each of the six basic resource types, there are several controllers:

  • *StatusController populates *Status resources observing the Linux kernel state.
  • *ConfigController produces the initial unmerged *Spec resources in the network-config namespace based on defaults, kernel command line, and machine configuration.
  • *MergeController merges *Spec resources into the final representation in the network namespace.
  • *SpecController applies merged *Spec resources to the kernel state.

For the network operators:

  • OperatorConfigController produces OperatorSpec resources based on machine configuration and deafauls.
  • OperatorSpecController runs network operators watching OperatorSpec resources and producing various *Spec resources in the network-config namespace.

Configuration Sources

There are several configuration sources for the network configuration, which are described in this section.

Defaults

  • lo interface is assigned addresses 127.0.0.1/8 and ::1/128;
  • hostname is set to the talos-<IP> where IP is the default node address;
  • resolvers are set to 8.8.8.8, 1.1.1.1;
  • time servers are set to pool.ntp.org;
  • DHCP4 operator is run on any physical interface which is not configured explicitly.

Cmdline

The kernel command line is parsed for the following options:

  • ip= option is parsed for node IP, default gateway, hostname, DNS servers, NTP servers;
  • talos.hostname= option is used to set node hostname;
  • talos.network.interface.ignore= can be used to make Talos skip network interface configuration completely.

Platform

Platform configuration delivers cloud environment-specific options (e.g. the hostname).

Operator

Network operators provide configuration for all basic resource types.

Machine Configuration

The machine configuration is parsed for link configuration, addresses, routes, hostname, resolvers and time servers. Any changes to .machine.network configuration can be applied in immediate mode.

Network Configuration Debugging

Most of the network controller operations and failures are logged to the kernel console, additional logs with debug level are available with talosctl logs controller-runtime command. If the network configuration can’t be established and the API is not available, debug level logs can be sent to the console with debug: true option in the machine configuration.

9.10 - Discovery

We maintain a public discovery service whereby members of your cluster can use a common and unique key to coordinate the most basic connection information (i.e. the set of possible “endpoints”, or IP:port pairs). We call this data “affiliate data.”

Note: If KubeSpan is enabled the data has the addition of the WireGuard public key.

Before sending data to the discovery service, Talos will encrypt the affiliate data with AES-GCM encryption and separately encrypt endpoints with AES in ECB mode so that endpoints coming from different sources can be deduplicated server-side. Each node submits it’s data encrypted plus it submits the endpoints it sees from other peers to the discovery service The discovery service aggregates the data, deduplicates the endpoints, and sends updates to each connected peer. Each peer receives information back about other affiliates from the discovery service, decrypts it and uses it to drive KubeSpan and cluster discovery. Moreover, the discovery service has no persistence. Data is stored in memory only with a TTL set by the clients (i.e. Talos). The cluster ID is used as a key to select the affiliates (so that different clusters see different affiliates).

To summarize, the discovery service knows the client version, cluster ID, the number of affiliates, some encrypted data for each affiliate, and a list of encrypted endpoints.

9.11 - KubeSpan

WireGuard Peer Discovery

The key pieces of information needed for WireGuard generally are:

  • the public key of the host you wish to connect to
  • an IP address and port of the host you wish to connect to

The latter is really only required of one side of the pair. Once traffic is received, that information is known and updated by WireGuard automatically and internally.

For Kubernetes, though, this is not quite sufficient. Kubernetes also needs to know which traffic goes to which WireGuard peer. Because this information may be dynamic, we need a way to be able to constantly keep this information up to date.

If we have a functional connection to Kubernetes otherwise, it’s fairly easy: we can just keep that information in Kubernetes. Otherwise, we have to have some way to discover it.

In our solution, we have a multi-tiered approach to gathering this information. Each tier can operate independently, but the amalgamation of the tiers produces a more robust set of connection criteria.

For this discussion, we will point out two of these tiers:

  • an external service
  • a Kubernetes-based system

See discovery service to learn more about the external service.

The Kubernetes-based system utilises annotations on Kubernetes Nodes which describe each node’s public key and local addresses.

On top of this, we also route Pod subnets. This is often (maybe even usually) taken care of by the CNI, but there are many situations where the CNI fails to be able to do this itself, across networks. So we also scrape the Kubernetes Node resource to discover its podCIDRs.

NAT, Multiple Routes, Multiple IPs

One of the difficulties in communicating across networks is that there is often not a single address and port which can identify a connection for each node on the system. For instance, a node sitting on the same network might see its peer as 192.168.2.10, but a node across the internet may see it as 2001:db8:1ef1::10.

We need to be able to handle any number of addresses and ports, and we also need to have a mechanism to try them. WireGuard only allows us to select one at a time.

For our implementation, then, we have built a controller which continuously discovers and rotates these IP:port pairs until a connection is established. It then starts trying again if that connection ever fails.

Packet Routing

After we have established a WireGuard connection, our work is not done. We still have to make sure that the right packets get sent to the WireGuard interface.

WireGuard supplies a convenient facility for tagging packets which come from it, which is great. But in our case, we need to be able to allow traffic which both does not come from WireGuard and also is not destined for another Kubernetes node to flow through the normal mechanisms.

Unlike many corporate or privacy-oriented VPNs, we need to allow general internet traffic to flow normally.

Also, as our cluster grows, this set of IP addresses can become quite large and quite dynamic. This would be very cumbersome and slow in iptables. Luckily, the kernel supplies a convenient mechanism by which to define this arbitrarily large set of IP addresses: IP sets.

Talos collects all of the IPs and subnets which are considered “in-cluster” and maintains these in the kernel as an IP set.

Now that we have the IP set defined, we need to tell the kernel how to use it.

The traditional way of doing this would be to use iptables. However, there is a big problem with IPTables. It is a common namespace in which any number of other pieces of software may dump things. We have no surety that what we add will not be wiped out by something else (from Kubernetes itself, to the CNI, to some workload application), be rendered unusable by higher-priority rules, or just generally cause trouble and conflicts.

Instead, we use a three-pronged system which is both more foundational and less centralised.

NFTables offers a separately namespaced, decentralised way of marking packets for later processing based on IP sets. Instead of a common set of well-known tables, NFTables uses hooks into the kernel’s netfilter system, which are less vulnerable to being usurped, bypassed, or a source of interference than IPTables, but which are rendered down by the kernel to the same underlying XTables system.

Our NFTables system is where we store the IP sets. Any packet which enters the system, either by forward from inside Kubernetes or by generation from the host itself, is compared against a hash table of this IP set. If it is matched, it is marked for later processing by our next stage. This is a high-performance system which exists fully in the kernel and which ultimately becomes an eBPF program, so it scales well to hundreds of nodes.

The next stage is the kernel router’s route rules. These are defined as a common ordered list of operations for the whole operating system, but they are intended to be tightly constrained and are rarely used by applications in any case. The rules we add are very simple: if a packet is marked by our NFTables system, send it to an alternate routing table.

This leads us to our third and final stage of packet routing. We have a custom routing table with two rules:

  • send all IPv4 traffic to the WireGuard interface
  • send all IPv6 traffic to the WireGuard interface

So in summary, we:

  • mark packets destined for Kubernetes applications or Kubernetes nodes
  • send marked packets to a special routing table
  • send anything which is sent to that routing table through the WireGuard interface

This gives us an isolated, resilient, tolerant, and non-invasive way to route Kubernetes traffic safely, automatically, and transparently through WireGuard across almost any set of network topologies.

9.12 - Developing Talos

This guide outlines steps and tricks to develop Talos operating systems and related components. The guide assumes Linux operating system on the development host. Some steps might work under Mac OS X, but using Linux is highly advised.

Prepare

Check out the Talos repository.

Try running make help to see available make commands. You would need Docker and buildx installed on the host.

Note: Usually it is better to install up to date Docker from Docker apt repositories, e.g. Ubuntu instructions.

If buildx plugin is not available with OS docker packages, it can be installed as a plugin from GitHub releases.

Set up a builder with access to the host network:

 docker buildx create --driver docker-container  --driver-opt network=host --name local1 --buildkitd-flags '--allow-insecure-entitlement security.insecure' --use

Note: network=host allows buildx builder to access host network, so that it can push to a local container registry (see below).

Make sure the following steps work:

  • make talosctl
  • make initramfs kernel

Set up a local docker registry:

docker run -d -p 5005:5000 \
    --restart always \
    --name local registry:2

Try to build and push to local registry an installer image:

make installer IMAGE_REGISTRY=127.0.0.1:5005 PUSH=true

Record the image name output in the step above.

Note: it is also possible to force a stable image tag by using TAG variable: make installer IMAGE_REGISTRY=127.0.0.1:5005 TAG=v1.0.0-beta.1 PUSH=true.

Running Talos cluster

Set up local caching docker registries (this speeds up Talos cluster boot a lot), script is in the Talos repo:

bash hack/start-registry-proxies.sh

Start your local cluster with:

sudo -E _out/talosctl-linux-amd64 cluster create \
    --provisioner=qemu \
    --cidr=172.20.0.0/24 \
    --registry-mirror docker.io=http://172.20.0.1:5000 \
    --registry-mirror k8s.gcr.io=http://172.20.0.1:5001  \
    --registry-mirror quay.io=http://172.20.0.1:5002 \
    --registry-mirror gcr.io=http://172.20.0.1:5003 \
    --registry-mirror ghcr.io=http://172.20.0.1:5004 \
    --registry-mirror 127.0.0.1:5005=http://172.20.0.1:5005 \
    --install-image=127.0.0.1:5005/talos-systems/installer:<RECORDED HASH from the build step> \
    --masters 3 \
    --workers 2 \
    --with-bootloader=false
  • --provisioner selects QEMU vs. default Docker
  • custom --cidr to make QEMU cluster use different network than default Docker setup (optional)
  • --registry-mirror uses the caching proxies set up above to speed up boot time a lot, last one adds your local registry (installer image was pushed to it)
  • --install-image is the image you built with make installer above
  • --masters & --workers configure cluster size, choose to match your resources; 3 masters give you HA control plane; 1 master is enough, never do 2 masters
  • --with-bootloader=false disables boot from disk (Talos will always boot from _out/vmlinuz-amd64 and _out/initramfs-amd64.xz). This speeds up development cycle a lot - no need to rebuild installer and perform install, rebooting is enough to get new code.

Note: as boot loader is not used, it’s not necessary to rebuild installer each time (old image is fine), but sometimes it’s needed (when configuration changes are done and old installer doesn’t validate the config).

talosctl cluster create derives Talos machine configuration version from the install image tag, so sometimes early in the development cycle (when new minor tag is not released yet), machine config version can be overridden with --talos-version=v0.14.

If the --with-bootloader=false flag is not enabled, for Talos cluster to pick up new changes to the code (in initramfs), it will require a Talos upgrade (so new installer should be built). With --with-bootloader=false flag, Talos always boots from initramfs in _out/ directory, so simple reboot is enough to pick up new code changes.

If the installation flow needs to be tested, --with-bootloader=false shouldn’t be used.

Console Logs

Watching console logs is easy with tail:

tail -F ~/.talos/clusters/talos-default/talos-default-*.log

Interacting with Talos

Once talosctl cluster create finishes successfully, talosconfig and kubeconfig will be set up automatically to point to your cluster.

Start playing with talosctl:

talosctl -n 172.20.0.2 version
talosctl -n 172.20.0.3,172.20.0.4 dashboard
talosctl -n 172.20.0.4 get members

Same with kubectl:

kubectl get nodes -o wide

You can deploy some Kubernetes workloads to the cluster.

You can edit machine config on the fly with talosctl edit mc --immediate, config patches can be applied via --config-patch flags, also many features have specific flags in talosctl cluster create.

Quick Reboot

To reboot whole cluster quickly (e.g. to pick up a change made in the code):

for socket in ~/.talos/clusters/talos-default/talos-default-*.monitor; do echo "q" | sudo socat - unix-connect:$socket; done

Sending q to a single socket allows to reboot a single node.

Note: This command performs immediate reboot (as if the machine was powered down and immediately powered back up), for normal Talos reboot use talosctl reboot.

Development Cycle

Fast development cycle:

  • bring up a cluster
  • make code changes
  • rebuild initramfs with make initramfs
  • reboot a node to pick new initramfs
  • verify code changes
  • more code changes…

Some aspects of Talos development require to enable bootloader (when working on installer itself), in that case quick development cycle is no longer possible, and cluster should be destroyed and recreated each time.

Running Integration Tests

If integration tests were changed (or when running them for the first time), first rebuild the integration test binary:

rm -f  _out/integration-test-linux-amd64; make _out/integration-test-linux-amd64

Running short tests against QEMU provisioned cluster:

_out/integration-test-linux-amd64 \
    -talos.provisioner=qemu \
    -test.v \
    -talos.crashdump=false \
    -test.short \
    -talos.talosctlpath=$PWD/_out/talosctl-linux-amd64

Whole test suite can be run removing -test.short flag.

Specfic tests can be run with -test.run=TestIntegration/api.ResetSuite.

Build Flavors

make <something> WITH_RACE=1 enables Go race detector, Talos runs slower and uses more memory, but memory races are detected.

make <something> WITH_DEBUG=1 enables Go profiling and other debug features, useful for local development.

Destroying Cluster

sudo -E ../talos/_out/talosctl-linux-amd64 cluster destroy --provisioner=qemu

This command stops QEMU and helper processes, tears down bridged network on the host, and cleans up cluster state in ~/.talos/clusters.

Note: if the host machine is rebooted, QEMU instances and helpers processes won’t be started back. In that case it’s required to clean up files in ~/.talos/clusters/<cluster-name> directory manually.

Optional

Set up cross-build environment with:

docker run --rm --privileged multiarch/qemu-user-static --reset -p yes

Note: the static qemu binaries which come with Ubuntu 21.10 seem to be broken.

Unit tests

Unit tests can be run in buildx with make unit-tests, on Ubuntu systems some tests using loop devices will fail because Ubuntu uses low-index loop devices for snaps.

Most of the unit-tests can be run standalone as well, with regular go test, or using IDE integration:

go test -v ./internal/pkg/circular/

This provides much faster feedback loop, but some tests require either elevated privileges (running as root) or additional binaries available only in Talos rootfs (containerd tests).

Running tests as root can be done with -exec flag to go test, but this is risky, as test code has root access and can potentially make undesired changes:

go test -exec sudo  -v ./internal/app/machined/pkg/controllers/network/...

Go Profiling

Build initramfs with debug enabled: make initramfs WITH_DEBUG=1.

Launch Talos cluster with bootloader disabled, and use go tool pprof to capture the profile and show the output in your browser:

go tool pprof http://172.20.0.2:9982/debug/pprof/heap

The IP address 172.20.0.2 is the address of the Talos node, and port :9982 depends on the Go application to profile:

  • 9981: apid
  • 9982: machined
  • 9983: trustd