The Forum found here helped with me RHEL8 repo. I was using Rocky Linux 8.10
dnf install cuda
Last metadata expiration check: 2:34:35 ago on Tue 14 Jan 2025 08:28:15 AM +08.
Error:
Problem: package cuda-12.6.3-1.x86_64 from cuda-rhel8-x86_64 requires nvidia-open >= 560.35.05, but none of the providers can be installed
- cannot install the best candidate for the job
- package nvidia-open-3:560.28.03-1.noarch from cuda-rhel8-x86_64 is filtered out by modular filtering
- package nvidia-open-3:560.35.03-1.noarch from cuda-rhel8-x86_64 is filtered out by modular filtering
- package nvidia-open-3:560.35.05-1.el8.noarch from cuda-rhel8-x86_64 is filtered out by modular filtering
- package nvidia-open-3:565.57.01-1.el8.noarch from cuda-rhel8-x86_64 is filtered out by modular filtering
(try to add '--skip-broken' to skip uninstallable packages or '--nobest' to use not only best candidate packages)
If you encounter slow nvidia-smi before the information is shown. For my 8 x A40 Cards, it took about 26 seconds to initialise.
The reason for slow initialization might be due to the driver persistence issue. For more background on the issue, do take a look at Nvidia Driver Persistence. According to the article,
The NVIDIA GPU driver has historically followed Unix design philosophies by only initializing software and hardware state when the user has configured the system to do so. Traditionally, this configuration was done via the X Server and the GPUs were only initialized when the X Server (on behalf of the user) requested that they be enabled. This is very important for the ability to reconfigure the GPUs without a reboot (for example, changing SLI mode or bus settings, especially in the AGP days).
More recently, this has proven to be a problem within compute-only environments, where X is not used and the GPUs are accessed via transient instantiations of the Cuda library. This results in the GPU state being initialized and deinitialized more often than the user truly wants and leads to long load times for each Cuda job, on the order of seconds.
NVIDIA previously provided Persistence Mode to solve this issue. This is a kernel-level solution that can be configured using nvidia-smi. This approach would prevent the kernel module from fully unloading software and hardware state when no user software was using the GPU. However, this approach creates subtle interaction problems with the rest of the system that have made maintenance difficult.
The purpose of the NVIDIA Persistence Daemon is to replace this kernel-level solution with a more robust user-space solution. This enables compute-only environments to more closely resemble the historically typical graphics environments that the NVIDIA GPU driver was designed around.
nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Datacenter (previously “Tesla”) GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:
GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring
This functionality is accessible programmatically though public APIs and interactively through CLI tools. It is designed to be run either as a standalone entity or as an embedded library within management tools. This document is intended as an overview of DCGM’s main goals and features and is intended for system administrators, ISV developers, and individual users managing groups of NVIDIA GPUs.
Installation
Assuming you are using RHEL Derivative like Rocky Linux 8, installation is a breeze
Step 3: Install the Kernel-Headers and Kernel-Devel
The CUDA Driver requires that the kernel headers and development packages for the running version of the kernel be installed at the time of the driver installation, as well as whenever the driver is rebuilt.
To install the Display Driver, the Nouveau drivers must first be disabled. I use a template to disable it. I created a template called blacklist-nouveau-conf.j2. Here is the content
blacklist nouveau options nouveau modeset=0
The Ansible script for disabling Noveau using a template
Step 6: Reboot if there are changes to Drivers and CUDA
- name: Reboot if there are changes to Drivers or CUDA
ansible.builtin.reboot:
when:
- install_driver.changed or install_cuda.changed
- ansible_os_family == "RedHat"
- ansible_distribution_major_version == "8"
Aftermath
After reboot, you should try to do “nvidia-smi” commands, hopefully, you should see
Traditional methods for performing data reductions are very costly in terms of latency and CPU cycles. The NVIDIA Quantum InfiniBand switch with NVIDIA SHARP technology addresses complex operations such as data reduction in a simplified, efficient way. By reducing data within the switch network, NVIDIA Quantum switches perform the reduction in a fraction of the time of traditional methods.
If you are running a cluster with NVIDIA GPUs, there now exists a python module for monitoring NVIDIA GPUs using the newly released Python bindings for NVML (NVIDIA Management Library). These bindings are under BSD license and allow simplified access to GPU metrics like temperature, memory usage, and utilization.