nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Datacenter (previously “Tesla”) GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:
GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring
This functionality is accessible programmatically though public APIs and interactively through CLI tools. It is designed to be run either as a standalone entity or as an embedded library within management tools. This document is intended as an overview of DCGM’s main goals and features and is intended for system administrators, ISV developers, and individual users managing groups of NVIDIA GPUs.
Installation
Assuming you are using RHEL Derivative like Rocky Linux 8, installation is a breeze
Step 3: Install the Kernel-Headers and Kernel-Devel
The CUDA Driver requires that the kernel headers and development packages for the running version of the kernel be installed at the time of the driver installation, as well as whenever the driver is rebuilt.
To install the Display Driver, the Nouveau drivers must first be disabled. I use a template to disable it. I created a template called blacklist-nouveau-conf.j2. Here is the content
blacklist nouveau options nouveau modeset=0
The Ansible script for disabling Noveau using a template
Step 6: Reboot if there are changes to Drivers and CUDA
- name: Reboot if there are changes to Drivers or CUDA
ansible.builtin.reboot:
when:
- install_driver.changed or install_cuda.changed
- ansible_os_family == "RedHat"
- ansible_distribution_major_version == "8"
Aftermath
After reboot, you should try to do “nvidia-smi” commands, hopefully, you should see
Error:
Problem 1: package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 2: package cuda-drivers-515.48.07-1.x86_64 requires nvidia-kmod >= 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 3: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 4: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-modprobe-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 5: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-settings-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
Problem 6: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-xconfig-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
The hint is that dkms is required.
nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
Last metadata expiration check: 0:01:01 ago on Mon 06 Jun 2022 08:47:40 PM EDT.
Error:
Problem 1: package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 2: package cuda-drivers-515.48.07-1.x86_64 requires nvidia-kmod >= 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 3: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 4: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-modprobe-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 5: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-settings-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
Problem 6: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
- package nvidia-xconfig-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
- package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
- cannot install the best candidate for the job
- package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
- package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
You will notice that the dkms issues has been resolved. Try not using the nvidia-driver:latest
# dnf module install nvidia-driver
===================================================================================================================================================
Package Architecture Version Repository Size
===================================================================================================================================================
Upgrading:
bcc x86_64 0.19.0-5.el8 appstream 674 k
bcc-tools x86_64 0.19.0-5.el8 appstream 447 k
bpftrace x86_64 0.12.1-4.el8 appstream 1.3 M
clang-libs x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 23 M
clang-resource-filesystem x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 13 k
compiler-rt x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 4.2 M
libglvnd x86_64 1:1.3.4-1.el8 appstream 126 k
libglvnd-egl x86_64 1:1.3.4-1.el8 appstream 48 k
libglvnd-gles x86_64 1:1.3.4-1.el8 appstream 39 k
libglvnd-glx x86_64 1:1.3.4-1.el8 appstream 136 k
libomp-devel x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 28 k
llvm-libs x86_64 13.0.1-1.module+el8.6.0+825+7e27476a appstream 24 M
mesa-dri-drivers x86_64 21.3.4-1.el8 appstream 11 M
mesa-filesystem x86_64 21.3.4-1.el8 appstream 33 k
mesa-libxatracker x86_64 21.3.4-1.el8 appstream 2.0 M
python3-bcc x86_64 0.19.0-5.el8 appstream 89 k
Installing group/module packages:
cuda-drivers x86_64 515.48.07-1 cuda-rhel8-x86_64 8.1 k
kmod-nvidia-latest-dkms x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 30 M
nvidia-driver x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 23 M
nvidia-driver-NVML x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 462 k
nvidia-driver-NvFBCOpenGL x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 54 k
nvidia-driver-cuda x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 455 k
nvidia-driver-cuda-libs x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 54 M
nvidia-driver-devel x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 13 k
nvidia-driver-libs x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 177 M
nvidia-kmod-common noarch 3:515.48.07-1.el8 cuda-rhel8-x86_64 13 k
nvidia-libXNVCtrl x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 26 k
nvidia-libXNVCtrl-devel x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 56 k
nvidia-modprobe x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 37 k
nvidia-persistenced x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 43 k
nvidia-settings x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 835 k
nvidia-xconfig x86_64 3:515.48.07-1.el8 cuda-rhel8-x86_64 106 k
Installing dependencies:
dnf-plugin-nvidia noarch 2.0-1.el8 cuda-rhel8-x86_64 12 k
egl-wayland x86_64 1.1.9-3.el8 appstream 39 k
libX11-devel x86_64 1.6.8-5.el8 appstream 975 k
libXau-devel x86_64 1.0.9-3.el8 appstream 19 k
libglvnd-opengl x86_64 1:1.3.4-1.el8 appstream 46 k
libvdpau x86_64 1.4-2.el8 appstream 40 k
libxcb-devel x86_64 1.13.1-1.el8 appstream 1.1 M
mesa-vulkan-drivers x86_64 21.3.4-1.el8 appstream 6.7 M
ocl-icd x86_64 2.2.12-1.el8 appstream 50 k
opencl-filesystem noarch 1.0-6.el8 appstream 7.3 k
vulkan-loader x86_64 1.3.204.0-2.el8 appstream 133 k
xorg-x11-proto-devel noarch 2020.1-3.el8 appstream 279 k
Installing module profiles:
nvidia-driver/default
Enabling module streams:
nvidia-driver latest-dkms
.....
.....
Finally do a
# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:A3:00.0 Off | 0 |
| N/A 49C P0 46W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... Off | 00000000:C3:00.0 Off | 0 |
| N/A 53C P0 46W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
If after a reboot and if you do a “nvidia-smi” and receive an error like the one
# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.