This is the multi-page printable view of this section. Click here to print.
Guides
- 1: Advanced Networking
- 2: Air-gapped Environments
- 3: Configuring Containerd
- 4: Configuring Corporate Proxies
- 5: Configuring Pull Through Cache
- 6: Configuring the Cluster Endpoint
- 7: Customizing the Kernel
- 8: Customizing the Root Filesystem
- 9: Managing PKI
- 10: Resetting a Machine
- 11: Upgrading Kubernetes
- 12: Upgrading Talos
1 - Advanced Networking
Static Addressing
Static addressing is comprised of specifying cidr
, routes
( remember to add your default gateway ), and interface
.
Most likely you’ll also want to define the nameservers
so you have properly functioning DNS.
machine:
network:
hostname: talos
nameservers:
- 10.0.0.1
time:
servers:
- time.cloudflare.com
interfaces:
- interface: eth0
cidr: 10.0.0.201/8
mtu: 8765
routes:
- network: 0.0.0.0/0
gateway: 10.0.0.1
- interface: eth1
ignore: true
Additional Addresses for an Interface
In some environments you may need to set additional addresses on an interface. In the following example, we set two additional addresses on the loopback interface.
machine:
network:
interfaces:
- interface: lo0
cidr: 192.168.0.21/24
- interface: lo0
cidr: 10.2.2.2/24
Bonding
The following example shows how to create a bonded interface.
machine:
network:
interfaces:
- interface: bond0
dhcp: true
bond:
mode: 802.3ad
lacpRate: fast
xmitHashPolicy: layer3+4
miimon: 100
updelay: 200
downdelay: 200
interfaces:
- eth0
- eth1
VLANs
To setup vlans on a specific device use an array of VLANs to add. The master device may be configured without addressing by setting dhcp to false.
machine:
network:
interfaces:
- interface: eth0
dhcp: false
vlans:
- vlanId: 100
cidr: "192.168.2.10/28"
routes:
- network: 0.0.0.0/0
gateway: 192.168.2.1
2 - Air-gapped Environments
In this guide we will create a Talos cluster running in an air-gapped environment with all the required images being pulled from an internal registry.
We will use the QEMU provisioner available in talosctl
to create a local cluster, but the same approach could be used to deploy Talos in bigger air-gapped networks.
Requirements
The follow are requirements for this guide:
- Docker 18.03 or greater
- Requirements for the Talos QEMU cluster
Identifying Images
In air-gapped environments, access to the public Internet is restricted, so Talos can’t pull images from public Docker registries (docker.io
, ghcr.io
, etc.)
We need to identify the images required to install and run Talos.
The same strategy can be used for images required by custom workloads running on the cluster.
The talosctl images
command provides a list of default images used by the Talos cluster (with default configuration
settings).
To print the list of images, run:
talosctl images
This list contains images required by a default deployment of Talos. There might be additional images required for the workloads running on this cluster, and those should be added to this list.
Preparing the Internal Registry
As access to the public registries is restricted, we have to run an internal Docker registry. In this guide, we will launch the registry on the same machine using Docker:
$ docker run -d -p 6000:5000 --restart always --name registry-aigrapped registry:2
1bf09802bee1476bc463d972c686f90a64640d87dacce1ac8485585de69c91a5
This registry will be accepting connections on port 6000 on the host IPs. The registry is empty by default, so we have fill it with the images required by Talos.
First, we pull all the images to our local Docker daemon:
$ for image in `talosctl images`; do docker pull $image; done
v0.12.0-amd64: Pulling from coreos/flannel
Digest: sha256:6d451d92c921f14bfb38196aacb6e506d4593c5b3c9d40a8b8a2506010dc3e10
...
All images are now stored in the Docker daemon store:
$ docker images
ghcr.io/talos-systems/install-cni v0.3.0-12-g90722c3 980d36ee2ee1 5 days ago 79.7MB
k8s.gcr.io/kube-proxy-amd64 v1.19.1 33c60812eab8 2 weeks ago 118MB
...
Now we need to re-tag them so that we can push them to our local registry.
We are going to replace the first component of the image name (before the first slash) with our registry endpoint 127.0.0.1:6000
:
$ for image in `talosctl images`; do \
docker tag $image `echo $image | sed -E 's#^[^/]+/#127.0.0.1:6000/#'` \
done
As the next step, we push images to the internal registry:
$ for image in `talosctl images`; do \
docker push `echo $image | sed -E 's#^[^/]+/#127.0.0.1:6000/#'` \
done
We can now verify that the images are pushed to the registry:
$ curl http://127.0.0.1:6000/v2/_catalog
{"repositories":["autonomy/kubelet","coredns","coreos/flannel","etcd-development/etcd","kube-apiserver-amd64","kube-controller-manager-amd64","kube-proxy-amd64","kube-scheduler-amd64","talos-systems/install-cni","talos-systems/installer"]}
Note: images in the registry don’t have the registry endpoint prefix anymore.
Launching Talos in an Air-gapped Environment
For Talos to use the internal registry, we use the registry mirror feature to redirect all the image pull requests to the internal registry. This means that the registry endpoint (as the first component of the image reference) gets ignored, and all pull requests are sent directly to the specified endpoint.
We are going to use a QEMU-based Talos cluster for this guide, but the same approach works with Docker-based clusters as well. As QEMU-based clusters go through the Talos install process, they can be used better to model a real air-gapped environment.
The talosctl cluster create
command provides conveniences for common configuration options.
The only required flag for this guide is --registry-mirror '*'=http://10.5.0.1:6000
which redirects every pull request to the internal registry.
The endpoint being used is 10.5.0.1
, as this is the default bridge interface address which will be routable from the QEMU VMs (127.0.0.1
IP will be pointing to the VM itself).
$ sudo -E talosctl cluster create --provisioner=qemu --registry-mirror '*'=http://10.5.0.1:6000 --install-image=ghcr.io/talos-systems/installer:v0.7.0
validating CIDR and reserving IPs
generating PKI and tokens
creating state directory in "/home/smira/.talos/clusters/talos-default"
creating network talos-default
creating load balancer
creating dhcpd
creating master nodes
creating worker nodes
waiting for API
...
Note:
--install-image
should match the image which was copied into the internal registry in the previous step.
You can be verify that the cluster is air-gapped by inspecting the registry logs: docker logs -f registry-airgapped
.
Closing Notes
Running in an air-gapped environment might require additional configuration changes, for example using custom settings for DNS and NTP servers.
When scaling this guide to the bare-metal environment, following Talos config snippet could be used as an equivalent of the --registry-mirror
flag above:
machine:
...
registries:
mirrors:
'*':
endpoints:
- http://10.5.0.1:6000/
...
Other implementations of Docker registry can be used in place of the Docker registry
image used above to run the registry.
If required, auth can be configured for the internal registry (and custom TLS certificates if needed).
3 - Configuring Containerd
The base containerd configuration expects to merge in any additional configs present in /var/cri/conf.d/*.toml
.
An example of exposing metrics
Into each machine config, add the following:
machine:
...
files:
- content: |
[metrics]
address = "0.0.0.0:11234"
path: /var/cri/conf.d/metrics.toml
op: create
Create cluster like normal and see that metrics are now present on this port:
$ curl 127.0.0.1:11234/v1/metrics
# HELP container_blkio_io_service_bytes_recursive_bytes The blkio io service bytes recursive
# TYPE container_blkio_io_service_bytes_recursive_bytes gauge
container_blkio_io_service_bytes_recursive_bytes{container_id="0677d73196f5f4be1d408aab1c4125cf9e6c458a4bea39e590ac779709ffbe14",device="/dev/dm-0",major="253",minor="0",namespace="k8s.io",op="Async"} 0
container_blkio_io_service_bytes_recursive_bytes{container_id="0677d73196f5f4be1d408aab1c4125cf9e6c458a4bea39e590ac779709ffbe14",device="/dev/dm-0",major="253",minor="0",namespace="k8s.io",op="Discard"} 0
...
...
4 - Configuring Corporate Proxies
Appending the Certificate Authority of MITM Proxies
Put into each machine the PEM encoded certificate:
machine:
...
files:
- content: |
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
permissions: 0644
path: /etc/ssl/certs/ca-certificates
op: append
Configuring a Machine to Use the Proxy
To make use of a proxy:
machine:
env:
http_proxy: <http proxy>
https_proxy: <https proxy>
no_proxy: <no proxy>
Additionally, configure the DNS nameservers
, and NTP servers
:
machine:
env:
...
time:
servers:
- <server 1>
- <server ...>
- <server n>
...
network:
nameservers:
- <ip 1>
- <ip ...>
- <ip n>
5 - Configuring Pull Through Cache
In this guide we will create a set of local caching Docker registry proxies to minimize local cluster startup time.
When running Talos locally, pulling images from Docker registries might take a significant amount of time. We spin up local caching pass-through registries to cache images and configure a local Talos cluster to use those proxies. A similar approach might be used to run Talos in production in air-gapped environments. It can be also used to verify that all the images are available in local registries.
Video Walkthrough
To see a live demo of this writeup, see the video below:
Requirements
The follow are requirements for creating the set of caching proxies:
Launch the Caching Docker Registry Proxies
Talos pulls from docker.io
, k8s.gcr.io
, gcr.io
, ghcr.io
and quay.io
by default.
If your configuration is different, you might need to modify the commands below:
docker run -d -p 5000:5000 \
-e REGISTRY_PROXY_REMOTEURL=https://registry-1.docker.io \
--restart always \
--name registry-docker.io registry:2
docker run -d -p 5001:5000 \
-e REGISTRY_PROXY_REMOTEURL=https://k8s.gcr.io \
--restart always \
--name registry-k8s.gcr.io registry:2
docker run -d -p 5002:5000 \
-e REGISTRY_PROXY_REMOTEURL=https://quay.io \
--restart always \
--name registry-quay.io registry:2.5
docker run -d -p 5003:5000 \
-e REGISTRY_PROXY_REMOTEURL=https://gcr.io \
--restart always \
--name registry-gcr.io registry:2
docker run -d -p 5004:5000 \
-e REGISTRY_PROXY_REMOTEURL=https://ghcr.io \
--restart always \
--name registry-ghcr.io registry:2
Note: Proxies are started as docker containers, and they’re automatically configured to start with Docker daemon. Please note that
quay.io
proxy doesn’t support recent Docker image schema, so we run older registry image version (2.5).
As a registry container can only handle a single upstream Docker registry, we launch a container per upstream, each on its own host port (5000, 5001, 5002, 5003 and 5004).
Using Caching Registries with QEMU
Local Cluster
With a QEMU local cluster, a bridge interface is created on the host. As registry containers expose their ports on the host, we can use bridge IP to direct proxy requests.
sudo talosctl cluster create --provisioner qemu \
--registry-mirror docker.io=http://10.5.0.1:5000 \
--registry-mirror k8s.gcr.io=http://10.5.0.1:5001 \
--registry-mirror quay.io=http://10.5.0.1:5002 \
--registry-mirror gcr.io=http://10.5.0.1:5003 \
--registry-mirror ghcr.io=http://10.5.0.1:5004
The Talos local cluster should now start pulling via caching registries.
This can be verified via registry logs, e.g. docker logs -f registry-docker.io
.
The first time cluster boots, images are pulled and cached, so next cluster boot should be much faster.
Note:
10.5.0.1
is a bridge IP with default network (10.5.0.0/24
), if using custom--cidr
, value should be adjusted accordingly.
Using Caching Registries with docker
Local Cluster
With a docker local cluster we can use docker bridge IP, default value for that IP is 172.17.0.1
.
On Linux, the docker bridge address can be inspected with ip addr show docker0
.
talosctl cluster create --provisioner docker \
--registry-mirror docker.io=http://172.17.0.1:5000 \
--registry-mirror k8s.gcr.io=http://172.17.0.1:5001 \
--registry-mirror quay.io=http://172.17.0.1:5002 \
--registry-mirror gcr.io=http://172.17.0.1:5003 \
--registry-mirror ghcr.io=http://172.17.0.1:5004
Cleaning Up
To cleanup, run:
docker rm -f registry-docker.io
docker rm -f registry-k8s.gcr.io
docker rm -f registry-quay.io
docker rm -f registry-gcr.io
docker rm -f registry-ghcr.io
Note: Removing docker registry containers also removes the image cache. So if you plan to use caching registries, keep the containers running.
6 - Configuring the Cluster Endpoint
In this section, we will step through the configuration of a Talos based Kubernetes cluster. There are three major components we will configure:
apid
andtalosctl
- the master nodes
- the worker nodes
Talos enforces a high level of security by using mutual TLS for authentication and authorization.
We recommend that the configuration of Talos be performed by a cluster owner. A cluster owner should be a person of authority within an organization, perhaps a director, manager, or senior member of a team. They are responsible for storing the root CA, and distributing the PKI for authorized cluster administrators.
Recommended settings
Talos runs great out of the box, but if you tweak some minor settings it will make your life a lot easier in the future. This is not a requirement, but rather a document to explain some key settings.
Endpoint
To configure the talosctl
endpoint, it is recommended you use a resolvable DNS name.
This way, if you decide to upgrade to a multi-controlplane cluster you only have to add the ip adres to the hostname configuration.
The configuration can either be done on a Loadbalancer, or simply trough DNS.
For example:
This is in the config file for the cluster e.g. init.yaml, controlplane.yaml and join.yaml. for more details, please see: v1alpha1 endpoint configuration
.....
cluster:
controlPlane:
endpoint: https://endpoint.example.local:6443
.....
If you have a DNS name as the endpoint, you can upgrade your talos cluster with multiple controlplanes in the future (if you don’t have a multi-controlplane setup from the start) Using a DNS name generates the corresponding Certificates (Kubernetes and Talos) for the correct hostname.
7 - Customizing the Kernel
FROM scratch AS customization
COPY --from=<custom kernel image> /lib/modules /lib/modules
FROM docker.io/andrewrynhard/installer:latest
COPY --from=<custom kernel image> /boot/vmlinuz /usr/install/vmlinuz
docker build --build-arg RM="/lib/modules" -t talos-installer .
Note: You can use the
--squash
flag to create smaller images.
Now that we have a custom installer we can build Talos for the specific platform we wish to deploy to.
8 - Customizing the Root Filesystem
The installer image contains ONBUILD
instructions that handle the following:
- the decompression, and unpacking of the
initramfs.xz
- the unsquashing of the rootfs
- the copying of new rootfs files
- the squashing of the new rootfs
- and the packing, and compression of the new
initramfs.xz
When used as a base image, the installer will perform the above steps automatically with the requirement that a customization
stage be defined in the Dockerfile
.
For example, say we have an image that contains the contents of a library we wish to add to the Talos rootfs.
We need to define a stage with the name customization
:
FROM scratch AS customization
COPY --from=<name|index> <src> <dest>
Using a multi-stage Dockerfile
we can define the customization
stage and build FROM
the installer image:
FROM scratch AS customization
COPY --from=<name|index> <src> <dest>
FROM ghcr.io/talos-systems/installer:latest
When building the image, the customization
stage will automatically be copied into the rootfs.
The customization
stage is not limited to a single COPY
instruction.
In fact, you can do whatever you would like in this stage, but keep in mind that everything in /
will be copied into the rootfs.
Note:
<dest>
is the path relative to the rootfs that you wish to place the contents of<src>
.
To build the image, run:
docker build --squash -t <organization>/installer:latest .
In the case that you need to perform some cleanup before adding additional files to the rootfs, you can specify the RM
build-time variable:
docker build --squash --build-arg RM="[<path> ...]" -t <organization>/installer:latest .
This will perform a rm -rf
on the specified paths relative to the rootfs.
Note:
RM
must be a whitespace delimited list.
The resulting image can be used to:
- generate an image for any of the supported providers
- perform bare-metall installs
- perform upgrades
We will step through common customizations in the remainder of this section.
9 - Managing PKI
Generating an Administrator Key Pair
In order to create a key pair, you will need the root CA.
Save the CA public key, and CA private key as ca.crt
, and ca.key
respectively.
Now, run the following commands to generate a certificate:
talosctl gen key --name admin
talosctl gen csr --key admin.key --ip 127.0.0.1
talosctl gen crt --ca ca --csr admin.csr --name admin
Now, base64 encode admin.crt
, and admin.key
:
cat admin.crt | base64
cat admin.key | base64
You can now set the crt
and key
fields in the talosconfig
to the base64 encoded strings.
Renewing an Expired Administrator Certificate
In order to renew the certificate, you will need the root CA, and the admin private key. The base64 encoded key can be found in any one of the control plane node’s configuration file. Where it is exactly will depend on the specific version of the configuration file you are using.
Save the CA public key, CA private key, and admin private key as ca.crt
, ca.key
, and admin.key
respectively.
Now, run the following commands to generate a certificate:
talosctl gen csr --key admin.key --ip 127.0.0.1
talosctl gen crt --ca ca --csr admin.csr --name admin
You should see admin.crt
in your current directory.
Now, base64 encode admin.crt
:
cat admin.crt | base64
You can now set the certificate in the talosconfig
to the base64 encoded string.
10 - Resetting a Machine
From time to time, it may be beneficial to reset a Talos machine to its “original” state. Bear in mind that this is a destructive action for the given machine. Doing this means removing the machine from Kubernetes, Etcd (if applicable), and clears any data on the machine that would normally persist a reboot.
The API command for doing this is talosctl reset
.
There are a couple of flags as part of this command:
Flags:
--graceful if true, attempt to cordon/drain node and leave etcd (if applicable) (default true)
--reboot if true, reboot the node after resetting instead of shutting down
The graceful
flag is especially important when considering HA vs. non-HA Talos clusters.
If the machine is part of an HA cluster, a normal, graceful reset should work just fine right out of the box as long as the cluster is in a good state.
However, if this is a single node cluster being used for testing purposes, a graceful reset is not an option since Etcd cannot be “left” if there is only a single member.
In this case, reset should be used with --graceful=false
to skip performing checks that would normally block the reset.
11 - Upgrading Kubernetes
Video Walkthrough
To see a live demo of this writeup, see the video below:
Kubeconfig
In order to edit the control plane, we will need a working kubectl
config.
If you don’t already have one, you can get one by running:
talosctl --nodes <master node> kubeconfig
Automated Kubernetes Upgrade
To upgrade from Kubernetes v1.18.6 to v1.19.0 run:
$ talosctl --nodes <master node> upgrade-k8s --from 1.18.6 --to 1.19.0
updating pod-checkpointer grace period to "0m"
sleeping 5m0s to let the pod-checkpointer self-checkpoint be updated
temporarily taking "kube-apiserver" out of pod-checkpointer control
updating daemonset "kube-apiserver" to version "1.19.0"
updating daemonset "kube-controller-manager" to version "1.19.0"
updating daemonset "kube-scheduler" to version "1.19.0"
updating daemonset "kube-proxy" to version "1.19.0"
updating pod-checkpointer grace period to "5m0s"
Manual Kubernetes Upgrade
Kubernetes can be upgraded manually as well by following the steps outlined below.
They are equivalent to the steps performed by the talosctl upgrade-k8s
command.
pod-checkpointer
Talos runs pod-checkpointer
component which helps to recover control plane components (specifically, API server) if control plane is not healthy.
However, the way checkpoints interact with API server upgrade may make an upgrade take a lot longer due to a race condition on API server listen port.
In order to speed up upgrades, first lower pod-checkpointer
grace period to zero (kubectl -n kube-system edit daemonset pod-checkpointer
), change:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: pod-checkpointer
command:
...
- --checkpoint-grace-period=5m0s
to:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: pod-checkpointer
command:
...
- --checkpoint-grace-period=0s
Wait for 5 minutes to let pod-checkpointer
update self-checkpoint to the new grace period.
API Server
In the API server’s DaemonSet
, change:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: kube-apiserver
image: ...
command:
- ./hyperkube
- kube-apiserver
to:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: kube-apiserver
image: k8s.gcr.io/kube-apiserver:v1.19.0
command:
- /go-runner
- /usr/local/bin/kube-apiserver
To edit the DaemonSet
, run:
kubectl edit daemonsets -n kube-system kube-apiserver
Controller Manager
In the controller manager’s DaemonSet
, change:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: kube-controller-manager
image: ...
command:
- ./hyperkube
- kube-controller-manager
to:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: kube-controller-manager
image: k8s.gcr.io/kube-controller-manager:v1.19.0
command:
- /go-runner
- /usr/local/bin/kube-controller-manager
To edit the DaemonSet
, run:
kubectl edit daemonsets -n kube-system kube-controller-manager
Scheduler
In the scheduler’s DaemonSet
, change:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: kube-scheduler
image: ...
command:
- ./hyperkube
- kube-scheduler
to:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: kube-sceduler
image: k8s.gcr.io/kube-scheduler:v1.19.0
command:
- /go-runner
- /usr/local/bin/kube-scheduler
To edit the DaemonSet
, run:
kubectl edit daemonsets -n kube-system kube-scheduler
Restoring pod-checkpointer
Restore grace period of 5 minutes (kubectl -n kube-system edit daemonset pod-checkpointer
), change:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: pod-checkpointer
command:
...
- --checkpoint-grace-period=0s
to:
kind: DaemonSet
...
spec:
...
template:
...
spec:
containers:
- name: pod-checkpointer
command:
...
- --checkpoint-grace-period=5m0s
Kubelet
The Talos team now maintains an image for the kubelet
that should be used starting with Kubernetes 1.19.
The image for this release is ghcr.io/talos-systems/kubelet:v1.19.3
.
To explicitly set the image, we can use the official documentation.
For example:
machine:
...
kubelet:
image: ghcr.io/talos-systems/kubelet:v1.19.3
12 - Upgrading Talos
Talos upgrades are effected by an API call.
The talosctl
CLI utility will facilitate this, or you can use the automatic upgrade features provided by the talos controller manager.
Video Walkthrough
To see a live demo of this writeup, see the video below:
talosctl
Upgrade
To manually upgrade a Talos node, you will specify the node’s IP address and the installer container image for the version of Talos to which you wish to upgrade.
For instance, if your Talos node has the IP address 10.20.30.40
and you want
to install the official version v0.7.0-beta.0
, you would enter a command such
as:
$ talosctl upgrade --nodes 10.20.30.40 \
--image ghcr.io/talos-systems/installer:v0.7.0-beta.0
There is an option to this command: --preserve
, which can be used to explicitly tell Talos to either keep intact its ephemeral data or not.
In most cases, it is correct to just let Talos perform its default action.
However, if you are running a single-node control-plane, you will want to make sure that --preserve=true
.
Talos Controller Manager
The Talos Controller Manager can coordinate upgrades of your nodes
automatically.
It ensures that a controllable number of nodes are being
upgraded at any given time.
It also applies an upgrade flow which allows you to classify some machines as
early adopters and others as getting only stable, tested versions.
To find out more about the controller manager and to get it installed and configured, take a look at the GitHub page. Please note that the controller manager is still in fairly early development. More advanced features, such as time slot scheduling, will be coming in the future.
Changelog and Upgrade Notes
In an effort to create more production ready clusters, Talos will now taint control plane nodes as unschedulable. This means that any application you might have deployed must tolerate this taint if you intend on running the application on control plane nodes.
Another feature you will notice is the automatic uncordoning of nodes that have been upgraded. Talos will now uncordon a node if the cordon was initiated by the upgrade process.
Talosctl
The talosctl
CLI now requires an explicit set of nodes.
This can be configured with talos config nodes
or set on the fly with talos --nodes
.