NVIDIA Fabric Manager
In this guide we’ll follow the procedure to enable NVIDIA Fabric Manager.
NVIDIA GPUs that have nvlink support (for eg: A100) will need the nvidia-fabricmanager system extension also enabled in addition to the NVIDIA drivers. For more information on Fabric Manager refer https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html
The published versions of the NVIDIA fabricmanager system extensions is available here
The
nvidia-fabricmanager
extension version has to match with the NVIDIA driver version in use.
Upgrading Talos and enabling the NVIDIA fabricmanager system extension
In addition to the patch defined in the NVIDIA drivers guide, we need to add the nvidia-fabricmanager
system extension to the patch yaml gpu-worker-patch.yaml
:
- op: add
path: /machine/install/extensions
value:
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:535.54.03-v1.5.5
- image: ghcr.io/siderolabs/nvidia-container-toolkit:535.54.03-v1.13.5
- image: ghcr.io/siderolabs/nvidia-fabricmanager:525.85.12
- op: add
path: /machine/kernel
value:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modeset
- op: add
path: /machine/sysctls
value:
net.core.bpf_jit_harden: 1
Last modified April 27, 2023: chore: fork docs and compatibility modules for Talos 1.5 (d9bdea2b5)