Version v1.5 of the documentation is no longer actively maintained. The site that you are currently viewing is an archived snapshot. For up-to-date documentation, see the latest version.

NVIDIA GPU (OSS drivers)

In this guide we’ll follow the procedure to support NVIDIA GPU using OSS drivers on Talos.

Enabling NVIDIA GPU support on Talos is bound by NVIDIA EULA. The Talos published NVIDIA OSS drivers are bound to a specific Talos release. The extensions versions also needs to be updated when upgrading Talos.

The published versions of the NVIDIA OSS system extensions can be found here:

Upgrading Talos and enabling the NVIDIA OSS modules and the system extension

Make sure to use talosctl version v1.5.5 or later

First create a patch yaml gpu-worker-patch.yaml to update the machine config similar to below:

- op: add path: /machine/install/extensions value: - image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:535.54.03-v1.5.5 - image: ghcr.io/siderolabs/nvidia-container-toolkit:535.54.03-v1.13.5 - op: add path: /machine/kernel value: modules: - name: nvidia - name: nvidia_uvm - name: nvidia_drm - name: nvidia_modeset - op: add path: /machine/sysctls value: net.core.bpf_jit_harden: 1

Update the driver version and Talos release in the above patch yaml from the published versions if there is a newer one available. Make sure the driver version matches for both the nvidia-open-gpu-kernel-modules and nvidia-container-toolkit extensions. The nvidia-open-gpu-kernel-modules extension is versioned as <nvidia-driver-version>-<talos-release-version> and the nvidia-container-toolkit extension is versioned as <nvidia-driver-version>-<nvidia-container-toolkit-version>.

Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU’s installed:

talosctl patch mc --patch @gpu-worker-patch.yaml

Now we can proceed to upgrading Talos to the same version to enable the system extension:

talosctl upgrade --image=ghcr.io/siderolabs/installer:v1.5.5

Once the node reboots, the NVIDIA modules should be loaded and the system extension should be installed.

This can be confirmed by running:

talosctl read /proc/modules

which should produce an output similar to below:

nvidia_uvm 1146880 - - Live 0xffffffffc2733000 (PO) nvidia_drm 69632 - - Live 0xffffffffc2721000 (PO) nvidia_modeset 1142784 - - Live 0xffffffffc25ea000 (PO) nvidia 39047168 - - Live 0xffffffffc00ac000 (PO)
talosctl get extensions

which should produce an output similar to below:

NODE NAMESPACE TYPE ID VERSION NAME VERSION 172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-container-toolkit-515.65.01-v1.10.0 1 nvidia-container-toolkit 515.65.01-v1.10.0 172.31.41.27 runtime ExtensionStatus 000.ghcr.io-siderolabs-nvidia-open-gpu-kernel-modules-515.65.01-v1.2.0 1 nvidia-open-gpu-kernel-modules 515.65.01-v1.2.0
talosctl read /proc/driver/nvidia/version

which should produce an output similar to below:

NVRM version: NVIDIA UNIX x86_64 Kernel Module 515.65.01 Wed Mar 16 11:24:05 UTC 2022 GCC version: gcc version 12.2.0 (GCC)

Deploying NVIDIA device plugin

First we need to create the RuntimeClass

Apply the following manifest to create a runtime class that uses the extension:

--- apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia

Install the NVIDIA device plugin:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.13.0 --set=runtimeClassName=nvidia

(Optional) Setting the default runtime class as nvidia

Do note that this will set the default runtime class to nvidia for all pods scheduled on the node.

Create a patch yaml nvidia-default-runtimeclass.yaml to update the machine config similar to below:

- op: add path: /machine/files value: - content: | [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "nvidia" path: /etc/cri/conf.d/20-customization.part op: create

Now apply the patch to all Talos nodes in the cluster having NVIDIA GPU’s installed:

talosctl patch mc --patch @nvidia-default-runtimeclass.yaml

Testing the runtime class

Note the spec.runtimeClassName being explicitly set to nvidia in the pod spec.

Run the following command to test the runtime class:

kubectl run \ nvidia-test \ --restart=Never \ -ti --rm \ --image nvcr.io/nvidia/cuda:12.1.0-base-ubuntu22.04 \ --overrides '{"spec": {"runtimeClassName": "nvidia"}}' \ nvidia-smi
Last modified August 16, 2023: docs: update nvidia docs (b5c0e7b24)