This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Kubernetes Guides

Management of a Kubernetes Cluster hosted by Talos Linux

1: Configuration

1.1: Ceph Storage cluster with Rook
1.2: Deploying Metrics Server
1.3: Device Plugins
1.4: iSCSI Storage with Synology CSI
1.5: KubePrism
1.6: Local Storage
1.7: Pod Security
1.8: Seccomp Profiles
1.9: Storage
1.10: User Namespaces

2: Network

2.1: Deploying Cilium CNI
2.2: Multus CNI

3: Upgrading Kubernetes

1 - Configuration

How to configure components of the Kubernetes cluster itself.

1.1 - Ceph Storage cluster with Rook

Guide on how to create a simple Ceph storage cluster with Rook for Kubernetes

Preparation

Talos Linux reserves an entire disk for the OS installation, so machines with multiple available disks are needed for a reliable Ceph cluster with Rook and Talos Linux. Rook requires that the block devices or partitions used by Ceph have no partitions or formatted filesystems before use. Rook also requires a minimum Kubernetes version of v1.16 and Helm v3.0 for installation of charts. It is highly recommended that the Rook Ceph overview is read and understood before deploying a Ceph cluster with Rook.

Installation

Creating a Ceph cluster with Rook requires two steps; first the Rook Operator needs to be installed which can be done with a Helm Chart. The example below installs the Rook Operator into the rook-ceph namespace, which is the default for a Ceph cluster with Rook.

$ helm repo add rook-release https://charts.rook.io/release
"rook-release" has been added to your repositories

$ helm install --create-namespace --namespace rook-ceph rook-ceph rook-release/rook-ceph
W0327 17:52:44.277830   54987 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0327 17:52:44.612243   54987 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
NAME: rook-ceph
LAST DEPLOYED: Sun Mar 27 17:52:42 2022
NAMESPACE: rook-ceph
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Rook Operator has been installed. Check its status by running:
  kubectl --namespace rook-ceph get pods -l "app=rook-ceph-operator"

Visit https://rook.io/docs/rook/latest for instructions on how to create and configure Rook clusters

Important Notes:
- You must customize the 'CephCluster' resource in the sample manifests for your cluster.
- Each CephCluster must be deployed to its own namespace, the samples use `rook-ceph` for the namespace.
- The sample manifests assume you also installed the rook-ceph operator in the `rook-ceph` namespace.
- The helm chart includes all the RBAC required to create a CephCluster CRD in the same namespace.
- Any disk devices you add to the cluster in the 'CephCluster' must be empty (no filesystem and no partitions).

Default PodSecurity configuration prevents execution of priviledged pods. Adding a label to the namespace will allow ceph to start.

kubectl label namespace rook-ceph pod-security.kubernetes.io/enforce=privileged

Once that is complete, the Ceph cluster can be installed with the official Helm Chart. The Chart can be installed with default values, which will attempt to use all nodes in the Kubernetes cluster, and all unused disks on each node for Ceph storage, and make available block storage, object storage, as well as a shared filesystem. Generally more specific node/device/cluster configuration is used, and the Rook documentation explains all the available options in detail. For this example the defaults will be adequate.

$ helm install --create-namespace --namespace rook-ceph rook-ceph-cluster --set operatorNamespace=rook-ceph rook-release/rook-ceph-cluster
NAME: rook-ceph-cluster
LAST DEPLOYED: Sun Mar 27 18:12:46 2022
NAMESPACE: rook-ceph
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Ceph Cluster has been installed. Check its status by running:
  kubectl --namespace rook-ceph get cephcluster

Visit https://rook.github.io/docs/rook/latest/ceph-cluster-crd.html for more information about the Ceph CRD.

Important Notes:
- You can only deploy a single cluster per namespace
- If you wish to delete this cluster and start fresh, you will also have to wipe the OSD disks using `sfdisk`

Now the Ceph cluster configuration has been created, the Rook operator needs time to install the Ceph cluster and bring all the components online. The progression of the Ceph cluster state can be followed with the following command.

$ watch kubectl --namespace rook-ceph get cephcluster rook-ceph
Every 2.0s: kubectl --namespace rook-ceph get cephcluster rook-ceph

NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE         MESSAGE                 HEALTH   EXTERNAL
rook-ceph   /var/lib/rook     3          57s   Progressing   Configuring Ceph Mons

Depending on the size of the Ceph cluster and the availability of resources the Ceph cluster should become available, and with it the storage classes that can be used with Kubernetes Physical Volumes.

$ kubectl --namespace rook-ceph get cephcluster rook-ceph
NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH      EXTERNAL
rook-ceph   /var/lib/rook     3          40m   Ready   Cluster created successfully   HEALTH_OK

$ kubectl  get storageclass
NAME                   PROVISIONER                     RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
ceph-block (default)   rook-ceph.rbd.csi.ceph.com      Delete          Immediate           true                   77m
ceph-bucket            rook-ceph.ceph.rook.io/bucket   Delete          Immediate           false                  77m
ceph-filesystem        rook-ceph.cephfs.csi.ceph.com   Delete          Immediate           true                   77m

Talos Linux Considerations

It is important to note that a Rook Ceph cluster saves cluster information directly onto the node (by default dataDirHostPath is set to /var/lib/rook).

By default, Rook configues Ceph to have 3 mon instances, in which case the data stored in dataDirHostPath can be regenerated from the other mon instances. So when performing maintenance on a Talos Linux node with a Rook Ceph cluster (e.g. upgrading the Talos Linux version), it is imperative that care be taken to maintain the health of the Ceph cluster. Before upgrading, you should always check the health status of the Ceph cluster to ensure that it is healthy.

$ kubectl --namespace rook-ceph get cephclusters.ceph.rook.io rook-ceph
NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH      EXTERNAL
rook-ceph   /var/lib/rook     3          98m   Ready   Cluster created successfully   HEALTH_OK

If it is, you can begin the upgrade process for the Talos Linux node, during which time the Ceph cluster will become unhealthy as the node is reconfigured. Before performing any other action on the Talos Linux nodes, the Ceph cluster must return to a healthy status.

$ talosctl upgrade --nodes 172.20.15.5 --image ghcr.io/talos-systems/installer:v0.14.3
NODE          ACK                        STARTED
172.20.15.5   Upgrade request received   2022-03-27 20:29:55.292432887 +0200 CEST m=+10.050399758

$ kubectl --namespace rook-ceph get cephclusters.ceph.rook.io
NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE         MESSAGE                   HEALTH        EXTERNAL
rook-ceph   /var/lib/rook     3          99m   Progressing   Configuring Ceph Mgr(s)   HEALTH_WARN

$ kubectl --namespace rook-ceph wait --timeout=1800s --for=jsonpath='{.status.ceph.health}=HEALTH_OK' cephclusters.ceph.rook.io rook-ceph
cephcluster.ceph.rook.io/rook-ceph condition met

The above steps need to be performed for each Talos Linux node undergoing maintenance, one at a time.

Cleaning Up

Rook Ceph Cluster Removal

Removing a Rook Ceph cluster requires a few steps, starting with signalling to Rook that the Ceph cluster is really being destroyed. Then all Persistent Volumes (and Claims) backed by the Ceph cluster must be deleted, followed by the Storage Classes and the Ceph storage types.

$ kubectl --namespace rook-ceph patch cephcluster rook-ceph --type merge -p '{"spec":{"cleanupPolicy":{"confirmation":"yes-really-destroy-data"}}}'
cephcluster.ceph.rook.io/rook-ceph patched

$ kubectl delete storageclasses ceph-block ceph-bucket ceph-filesystem
storageclass.storage.k8s.io "ceph-block" deleted
storageclass.storage.k8s.io "ceph-bucket" deleted
storageclass.storage.k8s.io "ceph-filesystem" deleted

$ kubectl --namespace rook-ceph delete cephblockpools ceph-blockpool
cephblockpool.ceph.rook.io "ceph-blockpool" deleted

$ kubectl --namespace rook-ceph delete cephobjectstore ceph-objectstore
cephobjectstore.ceph.rook.io "ceph-objectstore" deleted

$ kubectl --namespace rook-ceph delete cephfilesystem ceph-filesystem
cephfilesystem.ceph.rook.io "ceph-filesystem" deleted

Once that is complete, the Ceph cluster itself can be removed, along with the Rook Ceph cluster Helm chart installation.

$ kubectl --namespace rook-ceph delete cephcluster rook-ceph
cephcluster.ceph.rook.io "rook-ceph" deleted

$ helm --namespace rook-ceph uninstall rook-ceph-cluster
release "rook-ceph-cluster" uninstalled

If needed, the Rook Operator can also be removed along with all the Custom Resource Definitions that it created.

$ helm --namespace rook-ceph uninstall rook-ceph
W0328 12:41:14.998307  147203 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
These resources were kept due to the resource policy:
[CustomResourceDefinition] cephblockpools.ceph.rook.io
[CustomResourceDefinition] cephbucketnotifications.ceph.rook.io
[CustomResourceDefinition] cephbuckettopics.ceph.rook.io
[CustomResourceDefinition] cephclients.ceph.rook.io
[CustomResourceDefinition] cephclusters.ceph.rook.io
[CustomResourceDefinition] cephfilesystemmirrors.ceph.rook.io
[CustomResourceDefinition] cephfilesystems.ceph.rook.io
[CustomResourceDefinition] cephfilesystemsubvolumegroups.ceph.rook.io
[CustomResourceDefinition] cephnfses.ceph.rook.io
[CustomResourceDefinition] cephobjectrealms.ceph.rook.io
[CustomResourceDefinition] cephobjectstores.ceph.rook.io
[CustomResourceDefinition] cephobjectstoreusers.ceph.rook.io
[CustomResourceDefinition] cephobjectzonegroups.ceph.rook.io
[CustomResourceDefinition] cephobjectzones.ceph.rook.io
[CustomResourceDefinition] cephrbdmirrors.ceph.rook.io
[CustomResourceDefinition] objectbucketclaims.objectbucket.io
[CustomResourceDefinition] objectbuckets.objectbucket.io

release "rook-ceph" uninstalled

$ kubectl delete crds cephblockpools.ceph.rook.io cephbucketnotifications.ceph.rook.io cephbuckettopics.ceph.rook.io \
                      cephclients.ceph.rook.io cephclusters.ceph.rook.io cephfilesystemmirrors.ceph.rook.io \
                      cephfilesystems.ceph.rook.io cephfilesystemsubvolumegroups.ceph.rook.io \
                      cephnfses.ceph.rook.io cephobjectrealms.ceph.rook.io cephobjectstores.ceph.rook.io \
                      cephobjectstoreusers.ceph.rook.io cephobjectzonegroups.ceph.rook.io cephobjectzones.ceph.rook.io \
                      cephrbdmirrors.ceph.rook.io objectbucketclaims.objectbucket.io objectbuckets.objectbucket.io
customresourcedefinition.apiextensions.k8s.io "cephblockpools.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephbucketnotifications.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephbuckettopics.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephclients.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephclusters.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephfilesystemmirrors.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephfilesystems.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephfilesystemsubvolumegroups.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephnfses.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectrealms.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectstores.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectstoreusers.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectzonegroups.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephobjectzones.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "cephrbdmirrors.ceph.rook.io" deleted
customresourcedefinition.apiextensions.k8s.io "objectbucketclaims.objectbucket.io" deleted
customresourcedefinition.apiextensions.k8s.io "objectbuckets.objectbucket.io" deleted

Talos Linux Rook Metadata Removal

If the Rook Operator is cleanly removed following the above process, the node metadata and disks should be clean and ready to be re-used. In the case of an unclean cluster removal, there may be still a few instances of metadata stored on the system disk, as well as the partition information on the storage disks. First the node metadata needs to be removed, make sure to update the nodeName with the actual name of a storage node that needs cleaning, and path with the Rook configuration dataDirHostPath set when installing the chart. The following will need to be repeated for each node used in the Rook Ceph cluster.

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: disk-clean
spec:
  restartPolicy: Never
  nodeName: <storage-node-name>
  volumes:
  - name: rook-data-dir
    hostPath:
      path: <dataDirHostPath>
  containers:
  - name: disk-clean
    image: busybox
    securityContext:
      privileged: true
    volumeMounts:
    - name: rook-data-dir
      mountPath: /node/rook-data
    command: ["/bin/sh", "-c", "rm -rf /node/rook-data/*"]
EOF
pod/disk-clean created

$ kubectl wait --timeout=900s --for=jsonpath='{.status.phase}=Succeeded' pod disk-clean
pod/disk-clean condition met

$ kubectl delete pod disk-clean
pod "disk-clean" deleted

Lastly, the disks themselves need the partition and filesystem data wiped before they can be reused. Again, the following as to be repeated for each node and disk used in the Rook Ceph cluster, updating nodeName and of= in the command as needed.

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: disk-wipe
spec:
  restartPolicy: Never
  nodeName: <storage-node-name>
  containers:
  - name: disk-wipe
    image: busybox
    securityContext:
      privileged: true
    command: ["/bin/sh", "-c", "dd if=/dev/zero bs=1M count=100 oflag=direct of=<device>"]
EOF
pod/disk-wipe created

$ kubectl wait --timeout=900s --for=jsonpath='{.status.phase}=Succeeded' pod disk-wipe
pod/disk-wipe condition met

$ kubectl delete pod disk-wipe
pod "disk-wipe" deleted

1.2 - Deploying Metrics Server

In this guide you will learn how to set up metrics-server.

Metrics Server enables use of the Horizontal Pod Autoscaler and Vertical Pod Autoscaler. It does this by gathering metrics data from the kubelets in a cluster. By default, the certificates in use by the kubelets will not be recognized by metrics-server. This can be solved by either configuring metrics-server to do no validation of the TLS certificates, or by modifying the kubelet configuration to rotate its certificates and use ones that will be recognized by metrics-server.

Node Configuration

To enable kubelet certificate rotation, all nodes should have the following Machine Config snippet:

machine:
  kubelet:
    extraArgs:
      rotate-server-certificates: true

Install During Bootstrap

We will want to ensure that new certificates for the kubelets are approved automatically. This can easily be done with the Kubelet Serving Certificate Approver, which will automatically approve the Certificate Signing Requests generated by the kubelets.

We can have Kubelet Serving Certificate Approver and metrics-server installed on the cluster automatically during bootstrap by adding the following snippet to the Cluster Config of the node that will be handling the bootstrap process:

cluster:
  extraManifests:
    - https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml
    - https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Install After Bootstrap

If you choose not to use extraManifests to install Kubelet Serving Certificate Approver and metrics-server during bootstrap, you can install them once the cluster is online using kubectl:

kubectl apply -f https://raw.githubusercontent.com/alex1989hu/kubelet-serving-cert-approver/main/deploy/standalone-install.yaml
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

1.3 - Device Plugins

In this guide you will learn how to expose host devices to the Kubernetes pods.

Kubernetes Device Plugins can be used to expose host devices to the Kubernetes pods. This guide will show you how to deploy a device plugin to your Talos cluster. In this guide, we will use Kubernetes Generic Device Plugin, but there are other implementations available.

Deploying the Device Plugin

The Kubernetes Generic Device Plugin is a DaemonSet that runs on each node in the cluster, exposing the devices to the pods. The device plugin is configured with a list of devices to expose, e.g. --device='{"name": "video", "groups": [{"paths": [{"path": "/dev/video0"}]}]}.

In this guide, we will demonstrate how to deploy the device plugin with a configuration that exposes the /dev/net/tun device. This device is commonly used for user-space Wireguard, including Tailscale.

# generic-device-plugin.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: generic-device-plugin
  namespace: kube-system
  labels:
    app.kubernetes.io/name: generic-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: generic-device-plugin
  template:
    metadata:
      labels:
        app.kubernetes.io/name: generic-device-plugin
    spec:
      priorityClassName: system-node-critical
      tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"
      containers:
      - image: squat/generic-device-plugin
        args:
        - --device
        - |
          name: tun
          groups:
            - count: 1000
              paths:
                - path: /dev/net/tun          
        name: generic-device-plugin
        resources:
          requests:
            cpu: 50m
            memory: 10Mi
          limits:
            cpu: 50m
            memory: 20Mi
        ports:
        - containerPort: 8080
          name: http
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: dev
          mountPath: /dev
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev
        hostPath:
          path: /dev
  updateStrategy:
    type: RollingUpdate

Apply the manifest to your cluster:

kubectl apply -f generic-device-plugin.yaml

Once the device plugin is deployed, you can verify that the nodes have a new resource: squat.ai/tun (the tun name comes from the name of the group in the device plugin configuration).:

$ kubectl describe node worker-1
...
Allocated resources:
  Resource           Requests     Limits
  --------           --------     ------
  ...
  squat.ai/tun       0            0

Deploying a Pod with the Device

Now that the device plugin is deployed, you can deploy a pod that requests the device. The request for the device is specified as a resource in the pod spec.

requests:
  limits:
    squat.ai/tun: "1"

Here is an example non-privileged pod spec that requests the /dev/net/tun device:

# tun-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: tun-test
spec:
  containers:
  - image: alpine
    name: test
    command:
      - sleep
      - inf
    resources:
      limits:
        squat.ai/tun: "1"
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
        add:
          - NET_ADMIN
  dnsPolicy: ClusterFirst
  restartPolicy: Always

When running the pod, you should see the /dev/net/tun device available:

$ ls -l /dev/net/tun
crw-rw-rw-    1 root     root       10, 200 Sep 17 10:30 /dev/net/tun

1.4 - iSCSI Storage with Synology CSI

Automatically provision iSCSI volumes on a Synology NAS with the synology-csi driver.

Background

Synology is a company that specializes in Network Attached Storage (NAS) devices. They provide a number of features within a simple web OS, including an LDAP server, Docker support, and (perhaps most relevant to this guide) function as an iSCSI host. The focus of this guide is to allow a Kubernetes cluster running on Talos to provision Kubernetes storage (both dynamic or static) on a Synology NAS using a direct integration, rather than relying on an intermediary layer like Rook/Ceph or Maystor.

This guide assumes a very basic familiarity with iSCSI terminology (LUN, iSCSI target, etc.).

Prerequisites

Synology NAS running DSM 7.0 or above
Provisioned Talos cluster running Kubernetes v1.20 or above with siderolabs/iscsi-tools extension installed
(Optional) Both Volume Snapshot CRDs and the common snapshot controller must be installed in your Kubernetes cluster if you want to use the Snapshot feature

Setting up the Synology user account

The synology-csi controller interacts with your NAS in two different ways: via the API and via the iSCSI protocol. Actions such as creating a new iSCSI target or deleting an old one are accomplished via the Synology API, and require administrator access. On the other hand, mounting the disk to a pod and reading from / writing to it will utilize iSCSI. Because you can only authenticate with one account per DSM configured, that account needs to have admin privileges. In order to minimize access in the case of these credentials being compromised, you should configure the account with the lease possible amount of access – explicitly specify “No Access” on all volumes when configuring the user permissions.

Setting up the Synology CSI

Note: this guide is paraphrased from the Synology CSI readme. Please consult the readme for more in-depth instructions and explanations.

Clone the git repository.

git clone https://github.com/zebernst/synology-csi-talos.git

While Synology provides some automated scripts to deploy the CSI driver, they can be finicky especially when making changes to the source code. We will be configuring and deploying things manually in this guide.

The relevant files we will be touching are in the following locations:

.
├── Dockerfile
├── Makefile
├── config
│   └── client-info-template.yml
└── deploy
    └── kubernetes
        └── v1.20
            ├── controller.yml
            ├── csi-driver.yml
            ├── namespace.yml
            ├── node.yml
            ├── snapshotter
            │   ├── snapshotter.yaml
            │   └── volume-snapshot-class.yml
            └── storage-class.yml

Configure connection info

Use config/client-info-template.yml as an example to configure the connection information for DSM. You can specify one or more storage systems on which the CSI volumes will be created. See below for an example:

---
clients:
- host: 192.168.1.1   # ipv4 address or domain of the DSM
  port: 5000          # port for connecting to the DSM
  https: false        # set this true to use https. you need to specify the port to DSM HTTPS port as well
  username: username  # username
  password: password  # password

Create a Kubernetes secret using the client information config file.

kubectl create secret -n synology-csi generic client-info-secret --from-file=config/client-info.yml

Note that if you rename the secret to something other than client-info-secret, make sure you update the corresponding references in the deployment manifests as well.

Build the Talos-compatible image

Modify the Makefile so that the image is built and tagged under your GitHub Container Registry username:

REGISTRY_NAME=ghcr.io/<username>

When you run make docker-build or make docker-build-multiarch, it will push the resulting image to ghcr.io/<username>/synology-csi:v1.1.0. Ensure that you find and change any reference to synology/synology-csi:v1.1.0 to point to your newly-pushed image within the deployment manifests.

Configure the CSI driver

By default, the deployment manifests include one storage class and one volume snapshot class. See below for examples:

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
  name: syno-storage
provisioner: csi.san.synology.com
parameters:
  fsType: 'ext4'
  dsm: '192.168.1.1'
  location: '/volume1'
reclaimPolicy: Retain
allowVolumeExpansion: true
---
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: syno-snapshot
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
driver: csi.san.synology.com
deletionPolicy: Delete
parameters:
  description: 'Kubernetes CSI'

It can be useful to configure multiple different StorageClasses. For example, a popular strategy is to create two nearly identical StorageClasses, with one configured with reclaimPolicy: Retain and the other with reclaimPolicy: Delete. Alternately, a workload may require a specific filesystem, such as ext4. If a Synology NAS is going to be the most common way to configure storage on your cluster, it can be convenient to add the storageclass.kubernetes.io/is-default-class: "true" annotation to one of your StorageClasses.

The following table details the configurable parameters for the Synology StorageClass.

Name	Type	Description	Default	Supported protocols
dsm	string	The IPv4 address of your DSM, which must be included in the `client-info.yml` for the CSI driver to log in to DSM	-	iSCSI, SMB
location	string	The location (/volume1, /volume2, …) on DSM where the LUN for PersistentVolume will be created	-	iSCSI, SMB
fsType	string	The formatting file system of the PersistentVolumes when you mount them on the pods. This parameter only works with iSCSI. For SMB, the fsType is always ‘cifs‘.	`ext4`	iSCSI
protocol	string	The backing storage protocol. Enter ‘iscsi’ to create LUNs or ‘smb‘ to create shared folders on DSM.	`iscsi`	iSCSI, SMB
csi.storage.k8s.io/node-stage-secret-name	string	The name of node-stage-secret. Required if DSM shared folder is accessed via SMB.	-	SMB
csi.storage.k8s.io/node-stage-secret-namespace	string	The namespace of node-stage-secret. Required if DSM shared folder is accessed via SMB.	-	SMB

The VolumeSnapshotClass can be similarly configured with the following parameters:

Name	Type	Description	Default	Supported protocols
description	string	The description of the snapshot on DSM	-	iSCSI
is_locked	string	Whether you want to lock the snapshot on DSM	`false`	iSCSI, SMB

Apply YAML manifests

Once you have created the desired StorageClass(es) and VolumeSnapshotClass(es), the final step is to apply the Kubernetes manifests against the cluster. The easiest way to apply them all at once is to create a kustomization.yaml file in the same directory as the manifests and use Kustomize to apply:

kubectl apply -k path/to/manifest/directory

Alternately, you can apply each manifest one-by-one:

kubectl apply -f <file>

Run performance tests

In order to test the provisioning, mounting, and performance of using a Synology NAS as Kubernetes persistent storage, use the following command:

kubectl apply -f speedtest.yaml

Content of speedtest.yaml (source)

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: test-claim
spec:
#  storageClassName: syno-storage
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5G
---
apiVersion: batch/v1
kind: Job
metadata:
  name: read
spec:
  template:
    metadata:
      name: read
      labels:
        app: speedtest
        job: read
    spec:
      containers:
      - name: read
        image: ubuntu:xenial
        command: ["dd","if=/mnt/pv/test.img","of=/dev/null","bs=8k"]
        volumeMounts:
        - mountPath: "/mnt/pv"
          name: test-volume
      volumes:
      - name: test-volume
        persistentVolumeClaim:
          claimName: test-claim
      restartPolicy: Never
---
apiVersion: batch/v1
kind: Job
metadata:
  name: write
spec:
  template:
    metadata:
      name: write
      labels:
        app: speedtest
        job: write
    spec:
      containers:
      - name: write
        image: ubuntu:xenial
        command: ["dd","if=/dev/zero","of=/mnt/pv/test.img","bs=1G","count=1","oflag=dsync"]
        volumeMounts:
        - mountPath: "/mnt/pv"
          name: test-volume
      volumes:
      - name: test-volume
        persistentVolumeClaim:
          claimName: test-claim
      restartPolicy: Never

If these two jobs complete successfully, use the following commands to get the results of the speed tests:

# Pod logs for read test:
kubectl logs -l app=speedtest,job=read

# Pod logs for write test:
kubectl logs -l app=speedtest,job=write

When you’re satisfied with the results of the test, delete the artifacts created from the speedtest:

kubectl delete -f speedtest.yaml

1.5 - KubePrism

Enabling in-cluster highly-available controlplane endpoint.

Kubernetes pods running in CNI mode can use the kubernetes.default.svc service endpoint to access the Kubernetes API server, while pods running in host networking mode can only use the external cluster endpoint to access the Kubernetes API server.

Kubernetes controlplane components run in host networking mode, and it is critical for them to be able to access the Kubernetes API server, same as CNI components (when CNI requires access to Kubernetes API).

The external cluster endpoint might be unavailable due to misconfiguration or network issues, or it might have higher latency than the internal endpoint. A failure to access the Kubernetes API server might cause a series of issues in the cluster: pods are not scheduled, service IPs stop working, etc.

KubePrism feature solves this problem by enabling in-cluster highly-available controlplane endpoint on every node in the cluster.

Video Walkthrough

To see a live demo of this writeup, see the video below:

Enabling KubePrism

As of Talos 1.6, KubePrism is enabled by default with port 7445.

Note: the port specified should be available on every node in the cluster.

How it works

Talos spins up a TCP loadbalancer on every machine on the localhost on the specified port which automatically picks up one of the endpoints:

the external cluster endpoint as specified in the machine configuration
for controlplane machines: https://localhost:<api-server-local-port> (http://localhost:6443 in the default configuration)
https://<controlplane-address>:<api-server-port> for every controlplane machine (based on the information from Cluster Discovery)

KubePrism automatically filters out unhealthy (or unreachable) endpoints, and prefers lower-latency endpoints over higher-latency endpoints.

Talos automatically reconfigures kubelet, kube-scheduler and kube-controller-manager to use the KubePrism endpoint. The kube-proxy manifest is also reconfigured to use the KubePrism endpoint by default, but when enabling KubePrism for a running cluster the manifest should be updated with talosctl upgrade-k8s command.

When using CNI components that require access to the Kubernetes API server, the KubePrism endpoint should be passed to the CNI configuration (e.g. Cilium, Calico CNIs).

Notes

As the list of endpoints for KubePrism includes the external cluster endpoint, KubePrism in the worst case scenario will behave the same as the external cluster endpoint. For controlplane nodes, the KubePrism should pick up the localhost endpoint of the kube-apiserver, minimizing the latency. Worker nodes might use direct address of the controlplane endpoint if the latency is lower than the latency of the external cluster endpoint.

KubePrism listen endpoint is bound to localhost address, so it can’t be used outside the cluster.

1.6 - Local Storage

Using local storage for Kubernetes workloads.

Using local storage for Kubernetes workloads implies that the pod will be bound to the node where the local storage is available. Local storage is not replicated, so in case of a machine failure contents of the local storage will be lost.

`hostPath` mounts

The simplest way to use local storage is to use hostPath mounts. When using hostPath mounts, make sure the root directory of the mount is mounted into the kubelet container:

machine:
  kubelet:
    extraMounts:
      - destination: /var/mnt
        type: bind
        source: /var/mnt
        options:
          - bind
          - rshared
          - rw

Both EPHEMERAL partition and user disks can be used for hostPath mounts.

Local Path Provisioner

Local Path Provisioner can be used to dynamically provision local storage. Make sure to update its configuration to use a path under /var, e.g. /var/local-path-provisioner as the root path for the local storage. (In Talos Linux default local path provisioner path /opt/local-path-provisioner is read-only).

For example, Local Path Provisioner can be installed using kustomize with the following configuration:

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- github.com/rancher/local-path-provisioner/deploy?ref=v0.0.26
patches:
- patch: |-
    kind: ConfigMap
    apiVersion: v1
    metadata:
      name: local-path-config
      namespace: local-path-storage
    data:
      config.json: |-
        {
                "nodePathMap":[
                {
                        "node":"DEFAULT_PATH_FOR_NON_LISTED_NODES",
                        "paths":["/var/local-path-provisioner"]
                }
                ]
        }    
- patch: |-
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: local-path
      annotations:
        storageclass.kubernetes.io/is-default-class: "true"    
- patch: |-
    apiVersion: v1
    kind: Namespace
    metadata:
      name: local-path-storage
      labels:
        pod-security.kubernetes.io/enforce: privileged

Put kustomization.yaml into a new directory, and run kustomize build | kubectl apply -f - to install Local Path Provisioner to a Talos Linux cluster. There are three patches applied:

change default /opt/local-path-provisioner path to /var/local-path-provisioner
make local-path storage class the default storage class (optional)
label the local-path-storage namespace as privileged to allow privileged pods to be scheduled there

As for the hostPath mounts (see above), this will require the kubelet to bind mount the node’s folder you chose (eg: /var/local-path-provisioner). Otherwise, you’ll have erratic behavior, especially when using the subPath statement in a volumeMount, which may lead to data loss and/or data never freed after PV deletion.

machine:
  kubelet:
    extraMounts:
      - destination: /var/local-path-provisioner
        type: bind
        source: /var/local-path-provisioner
        options:
          - bind
          - rshared
          - rw

To test the Local Path Provisioner, you can refer to the Usage section of the official guide.

You can check that directories for PVCs are created on the node’s filesystem with the talosctl ls /var/local-path-provisioner command.

1.7 - Pod Security

Enabling Pod Security Admission plugin to configure Pod Security Standards.

Kubernetes deprecated Pod Security Policy as of v1.21, and it was removed in v1.25.

Pod Security Policy was replaced with Pod Security Admission, which is enabled by default starting with Kubernetes v1.23.

Talos Linux by default enables and configures Pod Security Admission plugin to enforce Pod Security Standards with the baseline profile as the default enforced with the exception of kube-system namespace which enforces privileged profile.

Some applications (e.g. Prometheus node exporter or storage solutions) require more relaxed Pod Security Standards, which can be configured by either updating the Pod Security Admission plugin configuration, or by using the pod-security.kubernetes.io/enforce label on the namespace level:

kubectl label namespace NAMESPACE-NAME pod-security.kubernetes.io/enforce=privileged

Configuration

Talos provides default Pod Security Admission in the machine configuration:

apiVersion: pod-security.admission.config.k8s.io/v1alpha1
kind: PodSecurityConfiguration
defaults:
    enforce: "baseline"
    enforce-version: "latest"
    audit: "restricted"
    audit-version: "latest"
    warn: "restricted"
    warn-version: "latest"
exemptions:
    usernames: []
    runtimeClasses: []
    namespaces: [kube-system]

This is a cluster-wide configuration for the Pod Security Admission plugin:

by default baseline Pod Security Standard profile is enforced
more strict restricted profile is not enforced, but API server warns about found issues

This default policy can be modified by updating the generated machine configuration before the cluster is created or on the fly by using the talosctl CLI utility.

Verify current admission plugin configuration with:

$ talosctl get admissioncontrolconfigs.kubernetes.talos.dev admission-control -o yaml
node: 172.20.0.2
metadata:
    namespace: controlplane
    type: AdmissionControlConfigs.kubernetes.talos.dev
    id: admission-control
    version: 1
    owner: config.K8sControlPlaneController
    phase: running
    created: 2022-02-22T20:28:21Z
    updated: 2022-02-22T20:28:21Z
spec:
    config:
        - name: PodSecurity
          configuration:
            apiVersion: pod-security.admission.config.k8s.io/v1alpha1
            defaults:
                audit: restricted
                audit-version: latest
                enforce: baseline
                enforce-version: latest
                warn: restricted
                warn-version: latest
            exemptions:
                namespaces:
                    - kube-system
                runtimeClasses: []
                usernames: []
            kind: PodSecurityConfiguration

Usage

Create a deployment that satisfies the baseline policy but gives warnings on restricted policy:

$ kubectl create deployment nginx --image=nginx
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "nginx" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "nginx" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "nginx" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "nginx" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/nginx created
$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-85b98978db-j68l8   1/1     Running   0          2m3s

Create a daemonset which fails to meet requirements of the baseline policy:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: debug-container
  name: debug-container
  namespace: default
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: debug-container
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: debug-container
    spec:
      containers:
      - args:
        - "360000"
        command:
        - /bin/sleep
        image: ubuntu:latest
        imagePullPolicy: IfNotPresent
        name: debug-container
        resources: {}
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirstWithHostNet
      hostIPC: true
      hostPID: true
      hostNetwork: true
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

$ kubectl apply -f debug.yaml
Warning: would violate PodSecurity "restricted:latest": host namespaces (hostNetwork=true, hostPID=true, hostIPC=true), privileged (container "debug-container" must not set securityContext.privileged=true), allowPrivilegeEscalation != false (container "debug-container" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "debug-container" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "debug-container" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "debug-container" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/debug-container created

Daemonset debug-container gets created, but no pods are scheduled:

$ kubectl get ds
NAME              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
debug-container   0         0         0       0            0           <none>          34s

Pod Security Admission plugin errors are in the daemonset events:

$ kubectl describe ds debug-container
...
  Warning  FailedCreate  92s                daemonset-controller  Error creating: pods "debug-container-kwzdj" is forbidden: violates PodSecurity "baseline:latest": host namespaces (hostNetwork=true, hostPID=true, hostIPC=true), privileged (container "debug-container" must not set securityContext.privileged=true)

Pod Security Admission configuration can also be overridden on a namespace level:

$ kubectl label ns default pod-security.kubernetes.io/enforce=privileged
namespace/default labeled
$ kubectl get ds
NAME              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
debug-container   2         2         0       2            0           <none>          4s

As enforce policy was updated to the privileged for the default namespace, debug-container is now successfully running.

1.8 - Seccomp Profiles

Using custom Seccomp Profiles with Kubernetes workloads.

Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel.

Refer the Kubernetes Seccomp Guide for more details.

In this guide we are going to configure a custom Seccomp Profile that logs all syscalls made by the workload.

Preparing the nodes

Create a machine config path with the contents below and save as patch.yaml

machine:
  seccompProfiles:
    - name: audit.json
      value:
        defaultAction: SCMP_ACT_LOG

Apply the machine config to all the nodes using talosctl:

talosctl -e <endpoint ip/hostname> -n <node ip/hostname> patch mc -p @patch.yaml

This would create a seccomp profile name audit.json on the node at /var/lib/kubelet/seccomp/profiles.

The profiles can be used by Kubernetes pods by specfying the pod securityContext as below:

spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/audit.json

Note that the localhostProfile uses the name of the profile created under profiles directory. So make sure to use path as profiles/<profile-name.json>

This can be verfied by running the below commands:

talosctl -e <endpoint ip/hostname> -n <node ip/hostname> get seccompprofiles

An output similar to below can be observed:

NODE       NAMESPACE   TYPE             ID           VERSION
10.5.0.3   cri         SeccompProfile   audit.json   1

The content of the seccomp profile can be viewed by running the below command:

talosctl -e <endpoint ip/hostname> -n <node ip/hostname> read /var/lib/kubelet/seccomp/profiles/audit.json

An output similar to below can be observed:

{"defaultAction":"SCMP_ACT_LOG"}

Create a Kubernetes workload that uses the custom Seccomp Profile

Here we’ll be using an example workload from the Kubernetes documentation.

First open up a second terminal and run the following talosctl command so that we can view the Syscalls being logged in realtime:

talosctl -e <endpoint ip/hostname> -n <node ip/hostname> dmesg --follow --tail

Now deploy the example workload from the Kubernetes documentation:

kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/audit-pod.yaml

Once the pod starts running the terminal running talosctl dmesg command from above should log similar to below:

10.5.0.3: kern:    info: [2022-07-28T11:49:42.489473063Z]: cni0: port 1(veth32488a86) entered blocking state
10.5.0.3: kern:    info: [2022-07-28T11:49:42.490852063Z]: cni0: port 1(veth32488a86) entered disabled state
10.5.0.3: kern:    info: [2022-07-28T11:49:42.492470063Z]: device veth32488a86 entered promiscuous mode
10.5.0.3: kern:    info: [2022-07-28T11:49:42.503105063Z]: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
10.5.0.3: kern:    info: [2022-07-28T11:49:42.503944063Z]: IPv6: ADDRCONF(NETDEV_CHANGE): veth32488a86: link becomes ready
10.5.0.3: kern:    info: [2022-07-28T11:49:42.504764063Z]: cni0: port 1(veth32488a86) entered blocking state
10.5.0.3: kern:    info: [2022-07-28T11:49:42.505423063Z]: cni0: port 1(veth32488a86) entered forwarding state
10.5.0.3: kern: warning: [2022-07-28T11:49:44.873616063Z]: kauditd_printk_skb: 14 callbacks suppressed
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.873619063Z]: audit: type=1326 audit(1659008985.445:25): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=3 compat=0 ip=0x55ec0657bd3b code=0x7ffc0000
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.876609063Z]: audit: type=1326 audit(1659008985.445:26): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=3 compat=0 ip=0x55ec0657bd3b code=0x7ffc0000
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.878789063Z]: audit: type=1326 audit(1659008985.449:27): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=257 compat=0 ip=0x55ec0657bdaa code=0x7ffc0000
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.886693063Z]: audit: type=1326 audit(1659008985.461:28): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=202 compat=0 ip=0x55ec06532b43 code=0x7ffc0000
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.888764063Z]: audit: type=1326 audit(1659008985.461:29): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=202 compat=0 ip=0x55ec06532b43 code=0x7ffc0000
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.891009063Z]: audit: type=1326 audit(1659008985.461:30): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=1 compat=0 ip=0x55ec0657bd3b code=0x7ffc0000
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.893162063Z]: audit: type=1326 audit(1659008985.461:31): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=3 compat=0 ip=0x55ec0657bd3b code=0x7ffc0000
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.895365063Z]: audit: type=1326 audit(1659008985.461:32): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=39 compat=0 ip=0x55ec066eb68b code=0x7ffc0000
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.898306063Z]: audit: type=1326 audit(1659008985.461:33): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="runc:[2:INIT]" exe="/" sig=0 arch=c000003e syscall=59 compat=0 ip=0x55ec0657be16 code=0x7ffc0000
10.5.0.3: kern:  notice: [2022-07-28T11:49:44.901518063Z]: audit: type=1326 audit(1659008985.473:34): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2784 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=158 compat=0 ip=0x455f35 code=0x7ffc0000

Cleanup

You can clean up the test resources by running the following command:

kubectl delete pod audit-pod

1.9 - Storage

Setting up storage for a Kubernetes cluster

In Kubernetes, using storage in the right way is well-facilitated by the API. However, unless you are running in a major public cloud, that API may not be hooked up to anything. This frequently sends users down a rabbit hole of researching all the various options for storage backends for their platform, for Kubernetes, and for their workloads. There are a lot of options out there, and it can be fairly bewildering.

For Talos, we try to limit the options somewhat to make the decision-making easier.

Public Cloud

If you are running on a major public cloud, use their block storage. It is easy and automatic.

Storage Clusters

Sidero Labs recommends having separate disks (apart from the Talos install disk) to be used for storage.

Redundancy, scaling capabilities, reliability, speed, maintenance load, and ease of use are all factors you must consider when managing your own storage.

Running a storage cluster can be a very good choice when managing your own storage, and there are two projects we recommend, depending on your situation.

If you need vast amounts of storage composed of more than a dozen or so disks, we recommend you use Rook to manage Ceph. Also, if you need both mount-once and mount-many capabilities, Ceph is your answer. Ceph also bundles in an S3-compatible object store. The down side of Ceph is that there are a lot of moving parts.

Please note that most people should never use mount-many semantics. NFS is pervasive because it is old and easy, not because it is a good idea. While it may seem like a convenience at first, there are all manner of locking, performance, change control, and reliability concerns inherent in any mount-many situation, so we strongly recommend you avoid this method.

If your storage needs are small enough to not need Ceph, use Mayastor.

Rook/Ceph

Ceph is the grandfather of open source storage clusters. It is big, has a lot of pieces, and will do just about anything. It scales better than almost any other system out there, open source or proprietary, being able to easily add and remove storage over time with no downtime, safely and easily. It comes bundled with RadosGW, an S3-compatible object store; CephFS, a NFS-like clustered filesystem; and RBD, a block storage system.

With the help of Rook, the vast majority of the complexity of Ceph is hidden away by a very robust operator, allowing you to control almost everything about your Ceph cluster from fairly simple Kubernetes CRDs.

So if Ceph is so great, why not use it for everything?

Ceph can be rather slow for small clusters. It relies heavily on CPUs and massive parallelisation to provide good cluster performance, so if you don’t have much of those dedicated to Ceph, it is not going to be well-optimised for you. Also, if your cluster is small, just running Ceph may eat up a significant amount of the resources you have available.

Troubleshooting Ceph can be difficult if you do not understand its architecture. There are lots of acronyms and the documentation assumes a fair level of knowledge. There are very good tools for inspection and debugging, but this is still frequently seen as a concern.

OpenEBS Mayastor replicated storage

Mayastor is an OpenEBS project built in Rust utilising the modern NVMEoF system. It is fast and lean but still cluster-oriented and cloud native.

Video Walkthrough

To see a live demo of this section, see the video below:

Prep Nodes

Either during initial cluster creation or on running worker nodes, several machine config values should be edited. (This information is gathered from the OpenEBS Replicated PV Mayastor documentation.)

This can be done with talosctl patch machineconfig or via config patches during talosctl gen config.

Some examples are shown below: modify as needed.

First create a config patch file named mayastor-patch.yaml with the following contents:

machine:
  sysctls:
    vm.nr_hugepages: "1024"
  nodeLabels:
    openebs.io/engine: "mayastor"
  kubelet:
    extraMounts:
      - destination: /var/local
        type: bind
        source: /var/local
        options:
          - bind
          - rshared
          - rw

Create another config patch file named mayastor-patch-cp.yaml with the following contents:

cluster:
  apiServer:
    admissionControl:
      - name: PodSecurity
        configuration:
          apiVersion: pod-security.admission.config.k8s.io/v1beta1
          kind: PodSecurityConfiguration
          exemptions:
            namespaces:
              - openebs

Using gen config

talosctl gen config my-cluster https://mycluster.local:6443 --config-patch-control-plane=@mayastor-patch-cp.yaml --config-patch-worker @mayastor-patch.yaml

Patching an existing node

talosctl patch machineconfig -n <node ip> --patch @mayastor-patch.yaml

Note: If you are adding/updating the vm.nr_hugepages on a node which already had the openebs.io/engine=mayastor label set, you’d need to restart kubelet so that it picks up the new value, by issuing the following command

talosctl -n <node ip> service kubelet restart

Deploy Mayastor

We need to disable the init container that checks for the nvme_tcp module, since Talos has that module built-in.

Create a helm values file mayastor-values.yaml with the following contents:

mayastor:
  csi:
    node:
      initContainers:
        enabled: false

If you do not need to use the LVM and ZFS engines they can be disabled in the values file:

engines:
  local:
    lvm:
      enabled: false
    zfs:
      enabled: false

Continue setting up Mayastor using the official documentation, passing the values file.

Follow the Post-Installation from official documentation to either create use Local Storage or Replicated Storage.

Piraeus / LINSTOR

Install Piraeus Operator V2

There is already a how-to for Talos: Link

Create first storage pool and PVC

Before proceeding, install linstor plugin for kubectl: https://github.com/piraeusdatastore/kubectl-linstor

Or use krew: kubectl krew install linstor

# Create device pool on a blank (no partition table!) disk on node01
kubectl linstor physical-storage create-device-pool --pool-name nvme_lvm_pool LVM node01 /dev/nvme0n1 --storage-pool nvme_pool

piraeus-sc.yml

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: simple-nvme
parameters:
  csi.storage.k8s.io/fstype: xfs
  linstor.csi.linbit.com/autoPlace: "3"
  linstor.csi.linbit.com/storagePool: nvme_pool
provisioner: linstor.csi.linbit.com
volumeBindingMode: WaitForFirstConsumer

# Create storage class
kubectl apply -f piraeus-sc.yml

NFS

NFS is an old pack animal long past its prime. NFS is slow, has all kinds of bottlenecks involving contention, distributed locking, single points of service, and more. However, it is supported by a wide variety of systems. You don’t want to use it unless you have to, but unfortunately, that “have to” is too frequent.

The NFS client is part of the kubelet image maintained by the Talos team. This means that the version installed in your running kubelet is the version of NFS supported by Talos. You can reduce some of the contention problems by parceling Persistent Volumes from separate underlying directories.

Object storage

Ceph comes with an S3-compatible object store, but there are other options, as well. These can often be built on top of other storage backends. For instance, you may have your block storage running with Mayastor but assign a Pod a large Persistent Volume to serve your object store.

One of the most popular open source add-on object stores is MinIO.

Others (iSCSI)

The most common remaining systems involve iSCSI in one form or another. These include the original OpenEBS, Rancher’s Longhorn, and many proprietary systems. iSCSI in Linux is facilitated by open-iscsi. This system was designed long before containers caught on, and it is not well suited to the task, especially when coupled with a read-only host operating system.

iSCSI support in Talos is now supported via the iscsi-tools system extension installed. The extension enables compatibility with OpenEBS Jiva - refer to the local storage installation guide for more information.

1.10 - User Namespaces

Guide on how to configure Talos Cluster to support User Namespaces

User Namespaces are a feature of the Linux kernel that allows unprivileged users to have their own range of UIDs and GIDs, without needing to be root.

Refer to the official documentation for more information on Usernamespaces.

Enabling Usernamespaces

To enable User Namespaces in Talos, you need to add the following configuration to Talos machine configuration:

---
cluster:
  apiServer:
    extraArgs:
      feature-gates: UserNamespacesSupport=true,UserNamespacesPodSecurityStandards=true
machine:
  sysctls:
    user.max_user_namespaces: "11255"
  kubelet:
    extraConfig:
      featureGates:
        UserNamespacesSupport: true
        UserNamespacesPodSecurityStandards: true

After applying the configuration, refer to the official documentation to configure workloads to use User Namespaces.

2 - Network

Managing the Kubernetes cluster networking

2.1 - Deploying Cilium CNI

In this guide you will learn how to set up Cilium CNI on Talos.

Cilium can be installed either via the cilium cli or using helm.

This documentation will outline installing Cilium CNI v1.14.0 on Talos in six different ways. Adhering to Talos principles we’ll deploy Cilium with IPAM mode set to Kubernetes, and using the cgroupv2 and bpffs mount that talos already provides. As Talos does not allow loading kernel modules by Kubernetes workloads, SYS_MODULE capability needs to be dropped from the Cilium default set of values, this override can be seen in the helm/cilium cli install commands. Each method can either install Cilium using kube proxy (default) or without: Kubernetes Without kube-proxy

In this guide we assume that KubePrism is enabled and configured to use the port 7445.

Machine config preparation

When generating the machine config for a node set the CNI to none. For example using a config patch:

Create a patch.yaml file with the following contents:

cluster:
  network:
    cni:
      name: none

talosctl gen config \
    my-cluster https://mycluster.local:6443 \
    --config-patch @patch.yaml

Or if you want to deploy Cilium without kube-proxy, you also need to disable kube proxy:

Create a patch.yaml file with the following contents:

cluster:
  network:
    cni:
      name: none
  proxy:
    disabled: true

talosctl gen config \
    my-cluster https://mycluster.local:6443 \
    --config-patch @patch.yaml

Installation using Cilium CLI

Note: It is recommended to template the cilium manifest using helm and use it as part of Talos machine config, but if you want to install Cilium using the Cilium CLI, you can follow the steps below.

Install the Cilium CLI following the steps here.

With kube-proxy

cilium install \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=false \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup

Without kube-proxy

cilium install \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=true \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup \
    --set k8sServiceHost=localhost \
    --set k8sServicePort=7445

Or if you want to deploy Cilium with support for Gateway API (requires installing cilium without kube-proxy), install Gateway API CRDs and set some extra parameters:

cilium install \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=true \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup \
    --set k8sServiceHost=localhost \
    --set k8sServicePort=7445 \
    --set gatewayAPI.enabled=true \
    --set gatewayAPI.enableAlpn=true \
    --set gatewayAPI.enableAppProtocol=true \

Note: If you plan to use gRPC and GRPCRoutes with TLS, you must enable ALPN by setting gatewayAPI.enableAlpn=true. Since gRPC relies on HTTP/2, ALPN is required to negotiate HTTP/2 support between the client and server.

Installation using Helm

Refer to Installing with Helm for more information.

First we’ll need to add the helm repo for Cilium.

helm repo add cilium https://helm.cilium.io/
helm repo update

Method 1: Helm install

After applying the machine config and bootstrapping Talos will appear to hang on phase 18/19 with the message: retrying error: node not ready. This happens because nodes in Kubernetes are only marked as ready once the CNI is up. As there is no CNI defined, the boot process is pending and will reboot the node to retry after 10 minutes, this is expected behavior.

During this window you can install Cilium manually by running the following:

helm install \
    cilium \
    cilium/cilium \
    --version 1.15.6 \
    --namespace kube-system \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=false \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup

Or if you want to deploy Cilium without kube-proxy, also set some extra parameters:

helm install \
    cilium \
    cilium/cilium \
    --version 1.15.6 \
    --namespace kube-system \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=true \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup \
    --set k8sServiceHost=localhost \
    --set k8sServicePort=7445

And with GatewayAPI support:

...
    --set=gatewayAPI.enabled=true \
    --set=gatewayAPI.enableAlpn=true \
    --set=gatewayAPI.enableAppProtocol=true

After Cilium is installed the boot process should continue and complete successfully.

Method 2: Helm manifests install

Instead of directly installing Cilium you can instead first generate the manifest and then apply it:

helm template \
    cilium \
    cilium/cilium \
    --version 1.15.6 \
    --namespace kube-system \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=false \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup > cilium.yaml

kubectl apply -f cilium.yaml

Without kube-proxy:

helm template \
    cilium \
    cilium/cilium \
    --version 1.15.6 \
    --namespace kube-system \
    --set ipam.mode=kubernetes \
    --set kubeProxyReplacement=true \
    --set securityContext.capabilities.ciliumAgent="{CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}" \
    --set securityContext.capabilities.cleanCiliumState="{NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}" \
    --set cgroup.autoMount.enabled=false \
    --set cgroup.hostRoot=/sys/fs/cgroup \
    --set k8sServiceHost=localhost \
    --set k8sServicePort=7445 > cilium.yaml

kubectl apply -f cilium.yaml

Method 3: Helm manifests hosted install

After generating cilium.yaml using helm template, instead of applying this manifest directly during the Talos boot window (before the reboot timeout). You can also host this file somewhere and patch the machine config to apply this manifest automatically during bootstrap. To do this patch your machine configuration to include this config instead of the above:

Create a patch.yaml file with the following contents:

cluster:
  network:
    cni:
      name: custom
      urls:
        - https://server.yourdomain.tld/some/path/cilium.yaml

talosctl gen config \
  my-cluster https://mycluster.local:6443 \
  --config-patch @patch.yaml

However, beware of the fact that the helm generated Cilium manifest contains sensitive key material. As such you should definitely not host this somewhere publicly accessible.

Method 4: Helm manifests inline install

A more secure option would be to include the helm template output manifest inside the machine configuration. The machine config should be generated with CNI set to none

Create a patch.yaml file with the following contents:

cluster:
  network:
    cni:
      name: none

talosctl gen config \
  my-cluster https://mycluster.local:6443 \
  --config-patch @patch.yaml

if deploying Cilium with kube-proxy disabled, you can also include the following:

Create a patch.yaml file with the following contents:

cluster:
  network:
    cni:
      name: none
  proxy:
    disabled: true

talosctl gen config \
  my-cluster https://mycluster.local:6443 \
  --config-patch @patch.yaml

To do so patch this into your machine configuration:

cluster:
  inlineManifests:
    - name: cilium
      contents: |
        --
        # Source: cilium/templates/cilium-agent/serviceaccount.yaml
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: "cilium"
          namespace: kube-system
        ---
        # Source: cilium/templates/cilium-operator/serviceaccount.yaml
        apiVersion: v1
        kind: ServiceAccount
        -> Your cilium.yaml file will be pretty long....

This will install the Cilium manifests at just the right time during bootstrap.

Beware though:

Changing the namespace when templating with Helm does not generate a manifest containing the yaml to create that namespace. As the inline manifest is processed from top to bottom make sure to manually put the namespace yaml at the start of the inline manifest.
Only add the Cilium inline manifest to the control plane nodes machine configuration.
Make sure all control plane nodes have an identical configuration.
If you delete any of the generated resources they will be restored whenever a control plane node reboots.
As a safety measure, Talos only creates missing resources from inline manifests, it never deletes or updates anything.
If you need to update a manifest make sure to first edit all control plane machine configurations and then run talosctl upgrade-k8s as it will take care of updating inline manifests.

Method 5: Using a job

We can utilize a job pattern run arbitrary logic during bootstrap time. We can leverage this to our advantage to install Cilium by using an inline manifest as shown in the example below:

cluster:
  inlineManifests:
    - name: cilium-install
      contents: |
        ---
        apiVersion: rbac.authorization.k8s.io/v1
        kind: ClusterRoleBinding
        metadata:
          name: cilium-install
        roleRef:
          apiGroup: rbac.authorization.k8s.io
          kind: ClusterRole
          name: cluster-admin
        subjects:
        - kind: ServiceAccount
          name: cilium-install
          namespace: kube-system
        ---
        apiVersion: v1
        kind: ServiceAccount
        metadata:
          name: cilium-install
          namespace: kube-system
        ---
        apiVersion: batch/v1
        kind: Job
        metadata:
          name: cilium-install
          namespace: kube-system
        spec:
          backoffLimit: 10
          template:
            metadata:
              labels:
                app: cilium-install
            spec:
              restartPolicy: OnFailure
              tolerations:
                - operator: Exists
                - effect: NoSchedule
                  operator: Exists
                - effect: NoExecute
                  operator: Exists
                - effect: PreferNoSchedule
                  operator: Exists
                - key: node-role.kubernetes.io/control-plane
                  operator: Exists
                  effect: NoSchedule
                - key: node-role.kubernetes.io/control-plane
                  operator: Exists
                  effect: NoExecute
                - key: node-role.kubernetes.io/control-plane
                  operator: Exists
                  effect: PreferNoSchedule
              affinity:
                nodeAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    nodeSelectorTerms:
                      - matchExpressions:
                          - key: node-role.kubernetes.io/control-plane
                            operator: Exists
              serviceAccount: cilium-install
              serviceAccountName: cilium-install
              hostNetwork: true
              containers:
              - name: cilium-install
                image: quay.io/cilium/cilium-cli-ci:latest
                env:
                - name: KUBERNETES_SERVICE_HOST
                  valueFrom:
                    fieldRef:
                      apiVersion: v1
                      fieldPath: status.podIP
                - name: KUBERNETES_SERVICE_PORT
                  value: "6443"
                command:
                  - cilium
                  - install
                  - --set
                  - ipam.mode=kubernetes
                  - --set
                  - kubeProxyReplacement=true
                  - --set
                  - securityContext.capabilities.ciliumAgent={CHOWN,KILL,NET_ADMIN,NET_RAW,IPC_LOCK,SYS_ADMIN,SYS_RESOURCE,DAC_OVERRIDE,FOWNER,SETGID,SETUID}
                  - --set
                  - securityContext.capabilities.cleanCiliumState={NET_ADMIN,SYS_ADMIN,SYS_RESOURCE}
                  - --set
                  - cgroup.autoMount.enabled=false
                  - --set
                  - cgroup.hostRoot=/sys/fs/cgroup
                  - --set
                  - k8sServiceHost=localhost
                  - --set
                  - k8sServicePort=7445

Because there is no CNI present at installation time the kubernetes.default.svc cannot be used to install Cilium, to overcome this limitation we’ll utilize the host network connection to connect back to itself with ‘hostNetwork: true’ in tandem with the environment variables KUBERNETES_SERVICE_PORT and KUBERNETES_SERVICE_HOST.

The job runs a container to install cilium to your liking, after the job is finished Cilium can be managed/operated like usual.

The above can be combined exchanged with for example Method 3 to host arbitrary configurations externally but render/run them at bootstrap time.

Known issues

There are some gotchas when using Talos and Cilium on the Google cloud platform when using internal load balancers. For more details: GCP ILB support / support scope local routes to be configured
When using Talos forwardKubeDNSToHost=true option (which is enabled by default) in combination with cilium bpf.masquerade=true. There is a known issue that causes CoreDNS to not work correctly. As a workaround, configuring forwardKubeDNSToHost=false resolves the issue. For more details see the discusssion here

Other things to know

After installing Cilium, cilium connectivity test might hang and/or fail with errors similar to
Error creating: pods "client-69748f45d8-9b9jg" is forbidden: violates PodSecurity "baseline:latest": non-default capabilities (container "client" must not include "NET_RAW" in securityContext.capabilities.add)
This is expected, you can workaround it by adding the pod-security.kubernetes.io/enforce=privileged label on the namespace level.
Talos has full kernel module support for eBPF, See:

2.2 - Multus CNI

A brief instruction on howto use Multus on Talos Linux

Multus CNI is a container network interface (CNI) plugin for Kubernetes that enables attaching multiple network interfaces to pods. Typically, in Kubernetes each pod only has one network interface (apart from a loopback) – with Multus you can create a multi-homed pod that has multiple interfaces. This is accomplished by Multus acting as a “meta-plugin”, a CNI plugin that can call multiple other CNI plugins.

Installation

Multus can be deployed by simply applying the thick DaemonSet with kubectl.

kubectl apply -f https://raw.githubusercontent.com/k8snetworkplumbingwg/multus-cni/master/deployments/multus-daemonset-thick.yml

This will create a DaemonSet and a CRD: NetworkAttachmentDefinition. This can be used to specify your network configuration.

Configuration

Patching the `DaemonSet`

For Multus to properly work with Talos a change need to be made to the DaemonSet. Instead of of mounting the volume called host-run-netns on /run/netns it has to be mounted on /var/run/netns.

Edit the DaemonSet and change the volume host-run-netns from /run/netns to /var/run/netns.

...
        - name: host-run-netns
          hostPath:
            path: /var/run/netns/

Failing to do so will leave your cluster crippled. Running pods will remain running but new pods and deployments will give you the following error in the events:

  Normal   Scheduled               3s    default-scheduler  Successfully assigned virtualmachines/samplepod to virt2
  Warning  FailedCreatePodSandBox  3s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3a6a58386dfbf2471a6f86bd41e4e9a32aac54ccccd1943742cb67d1e9c58b5b": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: 'ContainerID:"3a6a58386dfbf2471a6f86bd41e4e9a32aac54ccccd1943742cb67d1e9c58b5b" Netns:"/var/run/netns/cni-1d80f6e3-fdab-4505-eb83-7deb17431293" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=virtualmachines;K8S_POD_NAME=samplepod;K8S_POD_INFRA_CONTAINER_ID=3a6a58386dfbf2471a6f86bd41e4e9a32aac54ccccd1943742cb67d1e9c58b5b;K8S_POD_UID=8304765e-fd7e-4968-9144-c42c53be04f4" Path:"" ERRORED: error configuring pod [virtualmachines/samplepod] networking: [virtualmachines/samplepod/8304765e-fd7e-4968-9144-c42c53be04f4:cbr0]: error adding container to network "cbr0": DelegateAdd: cannot set "" interface name to "eth0": validateIfName: no net namespace /var/run/netns/cni-1d80f6e3-fdab-4505-eb83-7deb17431293 found: failed to Statfs "/var/run/netns/cni-1d80f6e3-fdab-4505-eb83-7deb17431293": no such file or directory
': StdinData: {"capabilities":{"portMappings":true},"clusterNetwork":"/host/etc/cni/net.d/10-flannel.conflist","cniVersion":"0.3.1","logLevel":"verbose","logToStderr":true,"name":"multus-cni-network","type":"multus-shim"}

As of March 21, 2025, Multus has a bug in the install-multus-binary container that can be lead to race problems after a node reboot. To prevent this issue, it is necessary to patch this container. Set the following command to the install-multus-binary container.

      initContainers:
        - name: install-multus-binary
          command:
            - "/usr/src/multus-cni/bin/install_multus"
            - "-d"
            - "/host/opt/cni/bin"
            - "-t"
            - "thick"

Creating your `NetworkAttachmentDefinition`

The NetworkAttachmentDefinition configuration is used to define your bridge where your second pod interface needs to be attached to.

apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-conf
spec:
  config: '{
      "cniVersion": "0.3.0",
      "type": "macvlan",
      "master": "eth0",
      "mode": "bridge",
      "ipam": {
        "type": "host-local",
        "subnet": "192.168.1.0/24",
        "rangeStart": "192.168.1.200",
        "rangeEnd": "192.168.1.216",
        "routes": [
          { "dst": "0.0.0.0/0" }
        ],
        "gateway": "192.168.1.1"
      }
    }'

In this example macvlan is used as a bridge type. There are 3 types of bridges: bridge, macvlan and ipvlan:

bridge is a way to connect two Ethernet segments together in a protocol-independent way. Packets are forwarded based on Ethernet address, rather than IP address (like a router). Since forwarding is done at Layer 2, all protocols can go transparently through a bridge. In terms of containers or virtual machines, a bridge can also be used to connect the virtual interfaces of each container/VM to the host network, allowing them to communicate.
macvlan is a driver that makes it possible to create virtual network interfaces that appear as distinct physical devices each with unique MAC addresses. The underlying interface can route traffic to each of these virtual interfaces separately, as if they were separate physical devices. This means that each macvlan interface can have its own IP subnet and routing. Macvlan interfaces are ideal for situations where containers or virtual machines require the same network access as the host system.
ipvlan is similar to macvlan, with the key difference being that ipvlan shares the parent’s MAC address, which requires less configuration from the networking equipment. This makes deployments simpler in certain situations where MAC address control or limits are in place. It offers two operational modes: L2 mode (the default) where it behaves similarly to a MACVLAN, and L3 mode for routing based traffic isolation (rather than bridged).

When using the bridge interface you must also configure a bridge on your Talos nodes. That can be done by updating Talos Linux machine configuration:

machine:
    network:
      interfaces:
      - interface: br0
        addresses:
          - 172.16.1.60/24
        bridge:
          stp:
            enabled: true
          interfaces:
              - eno1 # This must be changed to your matching interface name
        routes:
            - network: 0.0.0.0/0 # The route's network (destination).
              gateway: 172.16.1.254 # The route's gateway (if empty, creates link scope route).
              metric: 1024 # The optional metric for the route.

More information about the configuration of bridges can be found here

Attaching the `NetworkAttachmentDefinition` to your `Pod` or `Deployment`

After the NetworkAttachmentDefinition is configured, you can attach that interface to your your Deployment or Pod. In this example we use a pod:

apiVersion: v1
kind: Pod
metadata:
  name: samplepod
  annotations:
    k8s.v1.cni.cncf.io/networks: macvlan-conf
spec:
  containers:
  - name: samplepod
    command: ["/bin/ash", "-c", "trap : TERM INT; sleep infinity & wait"]
    image: alpine

Notes on using KubeVirt in combination with Multus

If you would like to use KubeVirt and expose your virtual machine to the outside world with Multus, make sure to configure a bridge instead of macvlan or ipvlan, because that doesn’t work, according to the KubeVirt Documentation.

Invalid CNIs for secondary networks The following list of CNIs is known not to work for bridge interfaces - which are most common for secondary interfaces.
macvlan
ipvlan
The reason is similar: the bridge interface type moves the pod interface MAC address to the VM, leaving the pod interface with a different address. The aforementioned CNIs require the pod interface to have the original MAC address.

Notes on using Cilium in combination with Multus

CNI reference plugins

Cilium does not ship the CNI reference plugins, which most multus setups are expecting (e.g. macvlan). This can be addressed by extending the daemonset with an additional init-container, setting them up, e.g. using the following kustomize strategic-merge patch:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: kube-multus-ds
  namespace: kube-system
spec:
  template:
    spec:
      initContainers:
      - command:
        - /install-cni.sh
        image: ghcr.io/siderolabs/install-cni:v1.7.0  # adapt to your talos version
        name: install-cni
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /host/opt/cni/bin
          mountPropagation: Bidirectional
          name: cnibin

Exclusive CNI

By default, Cilium is an exclusive CNI, meaning it removes other CNI configuration files. However, when using Multus, this behavior needs to be disabled. To do so, set the Helm variable cni.exclusive=false. For more information, refer to the Cilium documentation.

Notes on ARM64 nodes

The official images (as of 29.07.24) are built incorrectly for ARM64 (ref). Self-building them is an adequate workaround for now.

3 - Upgrading Kubernetes

Guide on how to upgrade the Kubernetes cluster from Talos Linux.

This guide covers upgrading Kubernetes on Talos Linux clusters.

For a list of Kubernetes versions compatible with each Talos release, see the Support Matrix.

For upgrading the Talos Linux operating system, see Upgrading Talos

Video Walkthrough

To see a demo of this process, watch this video:

Automated Kubernetes Upgrade

The recommended method to upgrade Kubernetes is to use the talosctl upgrade-k8s command. This will automatically update the components needed to upgrade Kubernetes safely. Upgrading Kubernetes is non-disruptive to the cluster workloads.

To trigger a Kubernetes upgrade, issue a command specifying the version of Kubernetes to ugprade to, such as:

talosctl --nodes <controlplane node> upgrade-k8s --to 1.32.0

Note that the --nodes parameter specifies the control plane node to send the API call to, but all members of the cluster will be upgraded.

To check what will be upgraded you can run talosctl upgrade-k8s with the --dry-run flag:

$ talosctl --nodes <controlplane node> upgrade-k8s --to 1.32.0 --dry-run
WARNING: found resources which are going to be deprecated/migrated in the version 1.32.0
RESOURCE                                                               COUNT
validatingwebhookconfigurations.v1beta1.admissionregistration.k8s.io   4
mutatingwebhookconfigurations.v1beta1.admissionregistration.k8s.io     3
customresourcedefinitions.v1beta1.apiextensions.k8s.io                 25
apiservices.v1beta1.apiregistration.k8s.io                             54
leases.v1beta1.coordination.k8s.io                                     4
automatically detected the lowest Kubernetes version 1.31.1
checking for resource APIs to be deprecated in version 1.32.0
discovered controlplane nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
discovered worker nodes ["172.20.0.5" "172.20.0.6"]
updating "kube-apiserver" to version "1.32.0"
 > "172.20.0.2": starting update
 > update kube-apiserver: v1.31.1 -> 1.32.0
 > skipped in dry-run
 > "172.20.0.3": starting update
 > update kube-apiserver: v1.31.1 -> 1.32.0
 > skipped in dry-run
 > "172.20.0.4": starting update
 > update kube-apiserver: v1.31.1 -> 1.32.0
 > skipped in dry-run
updating "kube-controller-manager" to version "1.32.0"
 > "172.20.0.2": starting update
 > update kube-controller-manager: v1.31.1 -> 1.32.0
 > skipped in dry-run
 > "172.20.0.3": starting update

<snip>

updating manifests
 > apply manifest Secret bootstrap-token-3lb63t
 > apply skipped in dry run
 > apply manifest ClusterRoleBinding system-bootstrap-approve-node-client-csr
 > apply skipped in dry run
<snip>

To upgrade Kubernetes from v1.31.1 to v1.32.0 run:

$ talosctl --nodes <controlplane node> upgrade-k8s --to 1.32.0
automatically detected the lowest Kubernetes version 1.31.1
checking for resource APIs to be deprecated in version 1.32.0
discovered controlplane nodes ["172.20.0.2" "172.20.0.3" "172.20.0.4"]
discovered worker nodes ["172.20.0.5" "172.20.0.6"]
updating "kube-apiserver" to version "1.32.0"
 > "172.20.0.2": starting update
 > update kube-apiserver: v1.31.1 -> 1.32.0
 > "172.20.0.2": machine configuration patched
 > "172.20.0.2": waiting for API server state pod update
 < "172.20.0.2": successfully updated
 > "172.20.0.3": starting update
 > update kube-apiserver: v1.31.1 -> 1.32.0
<snip>

This command runs in several phases:

Images for new Kubernetes components are pre-pulled to the nodes to minimize downtime and test for image availability.
Every control plane node machine configuration is patched with the new image version for each control plane component. Talos renders new static pod definitions on the configuration update which is picked up by the kubelet. The command waits for the change to propagate to the API server state.
The command updates the kube-proxy daemonset with the new image version.
On every node in the cluster, the kubelet version is updated. The command then waits for the kubelet service to be restarted and become healthy. The update is verified by checking the Node resource state.
Kubernetes bootstrap manifests are re-applied to the cluster. Updated bootstrap manifests might come with a new Talos version (e.g. CoreDNS version update), or might be the result of machine configuration change.

Note: The upgrade-k8s command never deletes any resources from the cluster: they should be deleted manually.

If the command fails for any reason, it can be safely restarted to continue the upgrade process from the moment of the failure.

Note: When using custom/overridden Kubernetes component images, use flags --*-image to override the default image names.

Manual Kubernetes Upgrade

Kubernetes can be upgraded manually by following the steps outlined below. They are equivalent to the steps performed by the talosctl upgrade-k8s command.

Kubeconfig

In order to edit the control plane, you need a working kubectl config. If you don’t already have one, you can get one by running:

talosctl --nodes <controlplane node> kubeconfig

API Server

Patch machine configuration using talosctl patch command:

$ talosctl -n <CONTROL_PLANE_IP_1> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/apiServer/image", "value": "registry.k8s.io/kube-apiserver:v1.32.0"}]'
patched mc at the node 172.20.0.2

The JSON patch might need to be adjusted if current machine configuration is missing .cluster.apiServer.image key.

Also the machine configuration can be edited manually with talosctl -n <IP> edit mc --mode=no-reboot.

Capture the new version of kube-apiserver config with:

$ talosctl -n <CONTROL_PLANE_IP_1> get apiserverconfig -o yaml
node: 172.20.0.2
metadata:
    namespace: controlplane
    type: APIServerConfigs.kubernetes.talos.dev
    id: kube-apiserver
    version: 5
    owner: k8s.ControlPlaneAPIServerController
    phase: running
spec:
    image: registry.k8s.io/kube-apiserver:v1.32.0
    cloudProvider: ""
    controlPlaneEndpoint: https://172.20.0.1:6443
    etcdServers:
        - https://localhost:2379
    localPort: 6443
    serviceCIDR:
        - 10.96.0.0/12
    extraArgs: {}
    extraVolumes: []
    environmentVariables: {}
    podSecurityPolicyEnabled: false
    advertisedAddress: $(POD_IP)
    resources:
        requests:
            cpu: ""
            memory: ""
        limits: {}

In this example, the new version is 5. Wait for the new pod definition to propagate to the API server state (replace talos-default-controlplane-1 with the node name):

$ kubectl get pod -n kube-system -l k8s-app=kube-apiserver --field-selector spec.nodeName=talos-default-controlplane-1 -o jsonpath='{.items[0].metadata.annotations.talos\.dev/config\-version}'
5

Check that the pod is running:

$ kubectl get pod -n kube-system -l k8s-app=kube-apiserver --field-selector spec.nodeName=talos-default-controlplane-1
NAME                                    READY   STATUS    RESTARTS   AGE
kube-apiserver-talos-default-controlplane-1   1/1     Running   0          16m

Repeat this process for every control plane node, verifying that state got propagated successfully between each node update.

Controller Manager

Patch machine configuration using talosctl patch command:

$ talosctl -n <CONTROL_PLANE_IP_1> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/controllerManager/image", "value": "registry.k8s.io/kube-controller-manager:v1.32.0"}]'
patched mc at the node 172.20.0.2

The JSON patch might need be adjusted if current machine configuration is missing .cluster.controllerManager.image key.

Capture new version of kube-controller-manager config with:

$ talosctl -n <CONTROL_PLANE_IP_1> get kcpc controllermanagerconfig -o yaml
node: 172.20.0.2
metadata:
    namespace: controlplane
    type: ControllerManagerConfigs.kubernetes.talos.dev
    id: kube-controller-manager
    version: 3
    owner: k8s.ControlPlaneControllerManagerController
    phase: running
spec:
    enabled: true
    image: registry.k8s.io/kube-controller-manager:v1.32.0
    cloudProvider: ""
    podCIDRs:
        - 10.244.0.0/16
    serviceCIDRs:
        - 10.96.0.0/12
    extraArgs: {}
    extraVolumes: []
    environmentVariables: {}
    resources:
        requests:
            cpu: ""
            memory: ""
        limits: {}

In this example, new version is 3. Wait for the new pod definition to propagate to the API server state (replace talos-default-controlplane-1 with the node name):

$ kubectl get pod -n kube-system -l k8s-app=kube-controller-manager --field-selector spec.nodeName=talos-default-controlplane-1 -o jsonpath='{.items[0].metadata.annotations.talos\.dev/config\-version}'
3

Check that the pod is running:

$ kubectl get pod -n kube-system -l k8s-app=kube-controller-manager --field-selector spec.nodeName=talos-default-controlplane-1
NAME                                             READY   STATUS    RESTARTS   AGE
kube-controller-manager-talos-default-controlplane-1   1/1     Running   0          35m

Repeat this process for every control plane node, verifying that state propagated successfully between each node update.

Scheduler

Patch machine configuration using talosctl patch command:

$ talosctl -n <CONTROL_PLANE_IP_1> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/cluster/scheduler/image", "value": "registry.k8s.io/kube-scheduler:v1.32.0"}]'
patched mc at the node 172.20.0.2

JSON patch might need be adjusted if current machine configuration is missing .cluster.scheduler.image key.

Capture new version of kube-scheduler config with:

$ talosctl -n <CONTROL_PLANE_IP_1> get schedulerconfig -o yaml
node: 172.20.0.2
metadata:
    namespace: controlplane
    type: SchedulerConfigs.kubernetes.talos.dev
    id: kube-scheduler
    version: 3
    owner: k8s.ControlPlaneSchedulerController
    phase: running
    created: 2024-11-06T12:37:22Z
    updated: 2024-11-06T12:37:20Z
spec:
    enabled: true
    image: registry.k8s.io/kube-scheduler:v1.32.0
    extraArgs: {}
    extraVolumes: []
    environmentVariables: {}
    resources:
        requests:
            cpu: ""
            memory: ""
        limits: {}
    config: {}

In this example, new version is 3. Wait for the new pod definition to propagate to the API server state (replace talos-default-controlplane-1 with the node name):

$ kubectl get pod -n kube-system -l k8s-app=kube-scheduler --field-selector spec.nodeName=talos-default-controlplane-1 -o jsonpath='{.items[0].metadata.annotations.talos\.dev/config\-version}'
3

Check that the pod is running:

$ kubectl get pod -n kube-system -l k8s-app=kube-scheduler --field-selector spec.nodeName=talos-default-controlplane-1
NAME                                    READY   STATUS    RESTARTS   AGE
kube-scheduler-talos-default-controlplane-1   1/1     Running   0          39m

Repeat this process for every control plane node, verifying that state got propagated successfully between each node update.

Proxy

In the proxy’s DaemonSet, change:

kind: DaemonSet
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - name: kube-proxy
          image: registry.k8s.io/kube-proxy:v1.31.1
      tolerations:
        - ...

to:

kind: DaemonSet
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - name: kube-proxy
          image: registry.k8s.io/kube-proxy:v1.32.0
      tolerations:
        - ...
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule

To edit the DaemonSet, run:

kubectl edit daemonsets -n kube-system kube-proxy

Bootstrap Manifests

Bootstrap manifests can be retrieved in a format which works for kubectl with the following command:

talosctl -n <controlplane IP> get manifests -o yaml | yq eval-all '.spec | .[] | splitDoc' - > manifests.yaml

Diff the manifests with the cluster:

kubectl diff -f manifests.yaml

Apply the manifests:

kubectl apply -f manifests.yaml

Note: if some bootstrap resources were removed, they have to be removed from the cluster manually.

kubelet

For every node, patch machine configuration with new kubelet version, wait for the kubelet to restart with new version:

$ talosctl -n <IP> patch mc --mode=no-reboot -p '[{"op": "replace", "path": "/machine/kubelet/image", "value": "ghcr.io/siderolabs/kubelet:v1.32.0"}]'
patched mc at the node 172.20.0.2

Once kubelet restarts with the new configuration, confirm upgrade with kubectl get nodes <name>:

$ kubectl get nodes talos-default-controlplane-1
NAME                           STATUS   ROLES                  AGE    VERSION
talos-default-controlplane-1   Ready    control-plane          123m   v1.32.0

Kubernetes Guides

1 - Configuration

1.1 - Ceph Storage cluster with Rook

Preparation

Installation

Talos Linux Considerations

Cleaning Up

Rook Ceph Cluster Removal

Talos Linux Rook Metadata Removal

1.2 - Deploying Metrics Server

Node Configuration

Install During Bootstrap

Install After Bootstrap

1.3 - Device Plugins

Deploying the Device Plugin

Deploying a Pod with the Device

1.4 - iSCSI Storage with Synology CSI

Background

Prerequisites

Setting up the Synology user account

Setting up the Synology CSI

Configure connection info

Build the Talos-compatible image

Configure the CSI driver

Apply YAML manifests

Run performance tests

1.5 - KubePrism

Video Walkthrough

Enabling KubePrism

How it works

Notes

1.6 - Local Storage

hostPath mounts

Local Path Provisioner

1.7 - Pod Security

Configuration

Usage

1.8 - Seccomp Profiles

Preparing the nodes

Create a Kubernetes workload that uses the custom Seccomp Profile

Cleanup

1.9 - Storage

Public Cloud

Storage Clusters

Rook/Ceph

OpenEBS Mayastor replicated storage

Video Walkthrough

Prep Nodes

Deploy Mayastor

Piraeus / LINSTOR

Install Piraeus Operator V2

Create first storage pool and PVC

NFS

Object storage

Others (iSCSI)

1.10 - User Namespaces

Enabling Usernamespaces

2 - Network

2.1 - Deploying Cilium CNI

Machine config preparation

Installation using Cilium CLI

With kube-proxy

Without kube-proxy

Installation using Helm

Method 1: Helm install

Method 2: Helm manifests install

Method 3: Helm manifests hosted install

Method 4: Helm manifests inline install

Method 5: Using a job

Known issues

Other things to know

2.2 - Multus CNI

Installation

Configuration

Patching the DaemonSet

Creating your NetworkAttachmentDefinition

Attaching the NetworkAttachmentDefinition to your Pod or Deployment

Notes on using KubeVirt in combination with Multus

Notes on using Cilium in combination with Multus

CNI reference plugins

`hostPath` mounts

Patching the `DaemonSet`

Creating your `NetworkAttachmentDefinition`

Attaching the `NetworkAttachmentDefinition` to your `Pod` or `Deployment`