February 11, 2024 by kittycool only

Nvidia-smi slow startup fix

If you encounter slow nvidia-smi before the information is shown. For my 8 x A40 Cards, it took about 26 seconds to initialise.

The reason for slow initialization might be due to the driver persistence issue. For more background on the issue, do take a look at Nvidia Driver Persistence. According to the article,

The NVIDIA GPU driver has historically followed Unix design philosophies by only initializing software and hardware state when the user has configured the system to do so. Traditionally, this configuration was done via the X Server and the GPUs were only initialized when the X Server (on behalf of the user) requested that they be enabled. This is very important for the ability to reconfigure the GPUs without a reboot (for example, changing SLI mode or bus settings, especially in the AGP days).

More recently, this has proven to be a problem within compute-only environments, where X is not used and the GPUs are accessed via transient instantiations of the Cuda library. This results in the GPU state being initialized and deinitialized more often than the user truly wants and leads to long load times for each Cuda job, on the order of seconds.

NVIDIA previously provided Persistence Mode to solve this issue. This is a kernel-level solution that can be configured using nvidia-smi. This approach would prevent the kernel module from fully unloading software and hardware state when no user software was using the GPU. However, this approach creates subtle interaction problems with the rest of the system that have made maintenance difficult.

The purpose of the NVIDIA Persistence Daemon is to replace this kernel-level solution with a more robust user-space solution. This enables compute-only environments to more closely resemble the historically typical graphics environments that the NVIDIA GPU driver was designed around.
Nvidia Driver Persistence

The Solution is very easy. Just start and enable nvidia-persistenced

# systemctl enable nvidia-persistenced
# systemctl start nvidia-persistenced

Immediately, the nvidia-smi command becomes more responsive

October 30, 2023 by kittycool only

Enabling Nvidia Tesla 4 x A100 with NVLink for MPI

I was having issues with the Applications like NetKET to detect and enable MPI.

Diagnosis

I have installed OpenMPI and enabled CUDA during the configuration.
CUDA Libraries including nvidia-smi has been installed without issue. But running, nvidia-smi topo –matrix, I am not able to see NVLink similar to

In fact, when I run NetKet on CUDA with MPI, the error that was generated was

mpirun noticed that process rank 0 with PID 0 on node gpu1 exited on signal 11 (Segmentation fault)."

Solution

This forum entry provided some enlightenment. https://forums.developer.nvidia.com/t/cuda-initialization-error-on-8x-a100-gpu-hgx-server/250936

The solution was to disable the Multi-instance GPU Mode which is enabled by default. Reboot the Server and it should see

nvidia-smi -mig 0

Enabling Persistence Mode

Make sure the configuration stays after a reboot.

# systemctl enable nvidia-persistenced.service
# systemctl start nvidia-persistenced.service

October 18, 2023 by kittycool only

Basic use of nvidia-smi commands

There is a very good article written by Microway on this utility. Take a look at nvidia-smi: Control Your GPUs

What is nvidia-smi?

nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

Installation

Do take a look at NVIDIA CUDA Installation Guide for Linux for more information

Query GPU Status

$ nvidia-smi -L

Query overall GPU usage with 1-second update intervals

$ nvidia-smi dmon

Query System/GPU Topology and NVLink

$ nvidia-smi topo --matrix

$ nvidia-smi nvlink --status

Query Details of GPU Cards

$ nvidia-smi -i 0 -q

October 17, 2023 by kittycool only

Basic Use of Nvidia Data Centre GPU Manager (DCGM)

For more information, take a look at The NVIDIA® Data Center GPU Manager (DCGM) . According to the Information,

The NVIDIA® Data Center GPU Manager (DCGM) simplifies administration of NVIDIA Datacenter (previously “Tesla”) GPUs in cluster and datacenter environments. At its heart, DCGM is an intelligent, lightweight user space library/agent that performs a variety of functions on each host system:

GPU behavior monitoring
GPU configuration management
GPU policy oversight
GPU health and diagnostics
GPU accounting and process statistics
NVSwitch configuration and monitoring

This functionality is accessible programmatically though public APIs and interactively through CLI tools. It is designed to be run either as a standalone entity or as an embedded library within management tools. This document is intended as an overview of DCGM’s main goals and features and is intended for system administrators, ISV developers, and individual users managing groups of NVIDIA GPUs.

Installation

Assuming you are using RHEL Derivative like Rocky Linux 8, installation is a breeze

# dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-rhel8.repo

# dnf install -y datacenter-gpu-manager

Enable the DCGM systemd service (on reboot) and start it now

# systemctl --now enable nvidia-dcgm
# systemctl start nvidia-dcgm

Basic Usage – Discovery

#  dcgmi discovery -l

Basic Usage – Diagnostic

To run a diagnostic test, you can use the command. You can decide on the level of diagostic. For example,

# dcgmi diag -r 2

If you want a more comprehensive diagnostic, you can use the command, you can use -r 3

# dcgmi diag -r 3

Basic Usage – NVLink Status

# dcgmi nvlink -s

August 2, 2023 by kittycool only

Installing CUDA with Ansible for Rocky Linux 8

Installation Guide

You can take a look at Nvidia CUDA Installation Guide for more information

Step 1: Get the Nvidia CUDA Repo

You can find the Repo from the Nvidia Download Sites. It should be named cuda_rhel8.repo. Copy it and use it as a template with a j2 extension.

[cuda-rhel8-x86_64]
name=cuda-rhel8-x86_64
baseurl=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64
enabled=1
gpgcheck=1
gpgkey=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/D42D0685.pub

Step 2: Use Ansible to Generate the repo from Templates.

The Ansible Script should look like this.

 - name: Generate /etc/yum.repos.d/cuda_rhel8.repo
   template:
    src: ../templates/cuda-rhel8-repo.j2
    dest: /etc/yum.repos.d/cuda_rhel8.repo
    owner: root
    group: root
    mode: 0644
   become: true
   when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

Step 3: Install the Kernel-Headers and Kernel-Devel

The CUDA Driver requires that the kernel headers and development packages for the running version of the kernel be installed at the time of the driver installation, as well as whenever the driver is rebuilt.

- name: Install Kernel-Headers and  Kernel-Devel
  dnf:
    name:
        - kernel-devel
        - kernel-headers
    state: present
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

Step 4: Disabling Nouveau

To install the Display Driver, the Nouveau drivers must first be disabled. I use a template to disable it. I created a template called blacklist-nouveau-conf.j2. Here is the content

blacklist nouveau
options nouveau modeset=0

The Ansible script for disabling Noveau using a template

- name: Generate blacklist nouveau
  template:
    src: ../templates/blacklist-nouveau-conf.j2
    dest: /etc/modprobe.d/blacklist-nouveau.conf
    owner: root
    group: root
    mode: 0644
  become: true
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

Step 5: Install the Drivers and CUDA

- name: Install driver packages RHEL 8 and newer
  dnf:
    name: '@nvidia-driver:latest-dkms'
    state: present
    update_cache: yes
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"
  register: install_driver

- name: Install CUDA
  dnf:
    name: cuda
    state: present
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"
  register: install_cuda

Step 6: Reboot if there are changes to Drivers and CUDA

- name: Reboot if there are changes to Drivers or CUDA
  ansible.builtin.reboot:
  when:
    - install_driver.changed or install_cuda.changed
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"

Aftermath

After reboot, you should try to do “nvidia-smi” commands, hopefully, you should see

If you have an error “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver for RHEL 8“, do follow the steps in NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver for RHEL 8 and run the ansible script in the blog.

You may also combine all these yml into one large yml file

Other better? Ansible Scripts

You may want to consider other better? options for https://github.com/NVIDIA/ansible-role-nvidia-docker

July 24, 2023 by kittycool only

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver for RHEL 8

If you have installed the CUDA Drivers and CUDA SDK using the NVIDIA CUDA Installation Guide for Linux. Look for Section 3.3.3 for RHEL 8 / Rocky 9

If after following instruction, you are still facing issues, you may want to consider the following

1- Blacklist nouveau.conf

$ vim /etc/modprobe.d/blacklist-nouveau.conf

blacklist nouveau
options nouveau modeset=0

2- Remove Nvidia driver installation

# dnf module remove --all nvidia-driver

3- Remove CUDA-Related Installation

sudo dnf remove "cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
 "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*"

4- Reboot

# shutdown -r now

References:

Forum – CentOS Stream 8: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver

January 9, 2023 by kittycool only

In-Network Computing with NVIDIA SHARP

Traditional methods for performing data reductions are very costly in terms of latency and CPU cycles. The NVIDIA Quantum InfiniBand switch with NVIDIA SHARP technology addresses complex operations such as data reduction in a simplified, efficient way. By reducing data within the switch network, NVIDIA Quantum switches perform the reduction in a fraction of the time of traditional methods.

October 9, 2022 by kittycool only

Ganglia and Gmond Python module for GPUs

If you are running a cluster with NVIDIA GPUs, there now exists a python module for monitoring NVIDIA GPUs using the newly released Python bindings for NVML (NVIDIA Management Library). These bindings are under BSD license and allow simplified access to GPU metrics like temperature, memory usage, and utilization.
Nvidia Developer – Ganglia Monitoring System

To install the Ganglia plug-in on your Ganglia installation, see these download links:

For more information see:

Acknowledgements:

Bernard Li (Lawrence Berkeley National Laboratory)
Jeremy Enos (National Center for Supercomputing Applications)

September 21, 2022 by kittycool only

Basic Commands for Mellanox Network Switches for Break-out-Ports

More information can be found at Command Line Interface (CLI)

Point 1: To configure Break-Out

> enable
# configure terminal
# interface ethernet ?
R2-R8-LEAF01 [standalone: master] (config) # interface ethernet ?
<Device/Port>[-<Device/Port>]
1/1/1
1/1/2
1/1/3
1/1/4
1/3/1
1/3/2
1/3/3
1/3/4
1/5/1
1/5/2
1/5/3
1/5/4
1/7/1
1/7/2
1/7/3
1/7/4
1/9/1
1/9/2
1/9/3
1/9/4
.....
.....
1/25
1/26
1/27
1/28
1/29
1/30
1/31
1/32
# interface ethernet 1/25 shutdown
# interface ethernet 1/26 shutdown
# interface ethernet 1/25
# (config interface ethernet 1/25) # module-type qsfp-split-4 force

The resulting interface will become

Ethernet 1/25/1
Ethernet 1/25/2
Ethernet 1/25/3
Ethernet 1/25/4

Speed configuration can be found at

interface ethernet 1/25/1
# speed 25G

June 7, 2022 by kittycool only

Cannot install the best candidate for the job for CUDA Drivers and Rocky Linux 8.5

I follow the blog Installing Nvidia Drivers on Rocky Linux 8.5. But I encountered an error that I have not encountered before

Error:
 Problem 1: package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 2: package cuda-drivers-515.48.07-1.x86_64 requires nvidia-kmod >= 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 3: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 4: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-modprobe-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 5: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-settings-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 6: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-xconfig-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64

The hint is that dkms is required.

nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64

Enable EPEL Repository

# dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

 # dnf config-manager --enable epel

Install dkms

 # dnf install dkms*

Install the latest Nvidia Drivers (If possible).

# dnf module install nvidia-driver:latest

If the Error pop out like this

Last metadata expiration check: 0:01:01 ago on Mon 06 Jun 2022 08:47:40 PM EDT.
Error:
 Problem 1: package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 2: package cuda-drivers-515.48.07-1.x86_64 requires nvidia-kmod >= 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 3: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 4: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-modprobe-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 5: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-settings-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 6: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-xconfig-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering

You will notice that the dkms issues has been resolved. Try not using the nvidia-driver:latest

# dnf module install nvidia-driver

===================================================================================================================================================
 Package                                Architecture        Version                                           Repository                      Size
===================================================================================================================================================
Upgrading:
 bcc                                    x86_64              0.19.0-5.el8                                      appstream                      674 k
 bcc-tools                              x86_64              0.19.0-5.el8                                      appstream                      447 k
 bpftrace                               x86_64              0.12.1-4.el8                                      appstream                      1.3 M
 clang-libs                             x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                       23 M
 clang-resource-filesystem              x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                       13 k
 compiler-rt                            x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                      4.2 M
 libglvnd                               x86_64              1:1.3.4-1.el8                                     appstream                      126 k
 libglvnd-egl                           x86_64              1:1.3.4-1.el8                                     appstream                       48 k
 libglvnd-gles                          x86_64              1:1.3.4-1.el8                                     appstream                       39 k
 libglvnd-glx                           x86_64              1:1.3.4-1.el8                                     appstream                      136 k
 libomp-devel                           x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                       28 k
 llvm-libs                              x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                       24 M
 mesa-dri-drivers                       x86_64              21.3.4-1.el8                                      appstream                       11 M
 mesa-filesystem                        x86_64              21.3.4-1.el8                                      appstream                       33 k
 mesa-libxatracker                      x86_64              21.3.4-1.el8                                      appstream                      2.0 M
 python3-bcc                            x86_64              0.19.0-5.el8                                      appstream                       89 k
Installing group/module packages:
 cuda-drivers                           x86_64              515.48.07-1                                       cuda-rhel8-x86_64              8.1 k
 kmod-nvidia-latest-dkms                x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               30 M
 nvidia-driver                          x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               23 M
 nvidia-driver-NVML                     x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              462 k
 nvidia-driver-NvFBCOpenGL              x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               54 k
 nvidia-driver-cuda                     x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              455 k
 nvidia-driver-cuda-libs                x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               54 M
 nvidia-driver-devel                    x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               13 k
 nvidia-driver-libs                     x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              177 M
 nvidia-kmod-common                     noarch              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               13 k
 nvidia-libXNVCtrl                      x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               26 k
 nvidia-libXNVCtrl-devel                x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               56 k
 nvidia-modprobe                        x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               37 k
 nvidia-persistenced                    x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               43 k
 nvidia-settings                        x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              835 k
 nvidia-xconfig                         x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              106 k
Installing dependencies:
 dnf-plugin-nvidia                      noarch              2.0-1.el8                                         cuda-rhel8-x86_64               12 k
 egl-wayland                            x86_64              1.1.9-3.el8                                       appstream                       39 k
 libX11-devel                           x86_64              1.6.8-5.el8                                       appstream                      975 k
 libXau-devel                           x86_64              1.0.9-3.el8                                       appstream                       19 k
 libglvnd-opengl                        x86_64              1:1.3.4-1.el8                                     appstream                       46 k
 libvdpau                               x86_64              1.4-2.el8                                         appstream                       40 k
 libxcb-devel                           x86_64              1.13.1-1.el8                                      appstream                      1.1 M
 mesa-vulkan-drivers                    x86_64              21.3.4-1.el8                                      appstream                      6.7 M
 ocl-icd                                x86_64              2.2.12-1.el8                                      appstream                       50 k
 opencl-filesystem                      noarch              1.0-6.el8                                         appstream                      7.3 k
 vulkan-loader                          x86_64              1.3.204.0-2.el8                                   appstream                      133 k
 xorg-x11-proto-devel                   noarch              2020.1-3.el8                                      appstream                      279 k
Installing module profiles:
 nvidia-driver/default
Enabling module streams:
 nvidia-driver                                              latest-dkms

.....
.....

Finally do a

# nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:A3:00.0 Off |                    0 |
| N/A   49C    P0    46W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:C3:00.0 Off |                    0 |
| N/A   53C    P0    46W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The Linux Cluster

Linux Cluster Blog is a collection of how-to and tutorials for Linux Cluster and Enterprise Linux

Nvidia