Nvidia-smi slow startup fix

If you encounter slow nvidia-smi before the information is shown. For my 8 x A40 Cards, it took about 26 seconds to initialise.

The reason for slow initialization might be due to the driver persistence issue. For more background on the issue, do take a look at Nvidia Driver Persistence. According to the article,

The NVIDIA GPU driver has historically followed Unix design philosophies by only initializing software and hardware state when the user has configured the system to do so. Traditionally, this configuration was done via the X Server and the GPUs were only initialized when the X Server (on behalf of the user) requested that they be enabled. This is very important for the ability to reconfigure the GPUs without a reboot (for example, changing SLI mode or bus settings, especially in the AGP days).

More recently, this has proven to be a problem within compute-only environments, where X is not used and the GPUs are accessed via transient instantiations of the Cuda library. This results in the GPU state being initialized and deinitialized more often than the user truly wants and leads to long load times for each Cuda job, on the order of seconds.

NVIDIA previously provided Persistence Mode to solve this issue. This is a kernel-level solution that can be configured using nvidia-smi. This approach would prevent the kernel module from fully unloading software and hardware state when no user software was using the GPU. However, this approach creates subtle interaction problems with the rest of the system that have made maintenance difficult.

The purpose of the NVIDIA Persistence Daemon is to replace this kernel-level solution with a more robust user-space solution. This enables compute-only environments to more closely resemble the historically typical graphics environments that the NVIDIA GPU driver was designed around.

Nvidia Driver Persistence

The Solution is very easy. Just start and enable nvidia-persistenced

# systemctl enable nvidia-persistenced
# systemctl start nvidia-persistenced

Immediately, the nvidia-smi command becomes more responsive

Basic use of nvidia-smi commands

There is a very good article written by Microway on this utility. Take a look at nvidia-smi: Control Your GPUs

What is nvidia-smi?

nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.


Do take a look at NVIDIA CUDA Installation Guide for Linux for more information

Query GPU Status

$ nvidia-smi -L

Query overall GPU usage with 1-second update intervals

$ nvidia-smi dmon

Query System/GPU Topology and NVLink

$ nvidia-smi topo --matrix
$ nvidia-smi nvlink --status

Query Details of GPU Cards

$ nvidia-smi -i 0 -q

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver for RHEL 8

If you have installed the CUDA Drivers and CUDA SDK using the NVIDIA CUDA Installation Guide for Linux. Look for Section 3.3.3 for RHEL 8 / Rocky 9

If after following instruction, you are still facing issues, you may want to consider the following

1- Blacklist nouveau.conf

$ vim /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0

2- Remove Nvidia driver installation

# dnf module remove --all nvidia-driver

3- Remove CUDA-Related Installation

sudo dnf remove "cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
 "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*"

4- Reboot

# shutdown -r now


  1. Forum – CentOS Stream 8: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver

Encountering shm_open permission denied issues with hpcx

If you are using Nvidia hpc-x and encountering issues like the one below during your MPI Run

shm_open(file_name=/ucx_shm_posix_77de2cf3 flags=0xc2) failed: Permission denied

The error message indicates that the shared memory has no permission to be used,  The permission of /dev/shm is found to be 755, not 777, causing the error. The issue can be resolved after the permission is changed to 777. To change and verify the changes:

% chmod 777 /dev/shm 
% ls -ld /dev/shm
drwxrwxrwx 2 root root 40 Jul  6 15:18 /dev/sh

Cannot install the best candidate for the job for CUDA Drivers and Rocky Linux 8.5

I follow the blog Installing Nvidia Drivers on Rocky Linux 8.5. But I encountered an error that I have not encountered before

 Problem 1: package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 2: package cuda-drivers-515.48.07-1.x86_64 requires nvidia-kmod >= 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 3: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 4: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-modprobe-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 5: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-settings-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64
 Problem 6: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-xconfig-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - nothing provides dkms needed by kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64

The hint is that dkms is required.

nothing provides dkms needed by kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64

Enable EPEL Repository

# dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
 # dnf config-manager --enable epel

Install dkms

 # dnf install dkms*

Install the latest Nvidia Drivers (If possible).

# dnf module install nvidia-driver:latest

If the Error pop out like this

Last metadata expiration check: 0:01:01 ago on Mon 06 Jun 2022 08:47:40 PM EDT.
 Problem 1: package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 2: package cuda-drivers-515.48.07-1.x86_64 requires nvidia-kmod >= 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 3: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 4: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-modprobe-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 5: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-settings-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
 Problem 6: package nvidia-driver-3:515.48.07-1.el8.x86_64 requires nvidia-kmod-common = 3:515.48.07, but none of the providers can be installed
  - package nvidia-xconfig-3:515.48.07-1.el8.x86_64 requires nvidia-driver(x86-64) = 3:515.48.07, but none of the providers can be installed
  - package nvidia-kmod-common-3:515.48.07-1.el8.noarch requires nvidia-kmod = 3:515.48.07, but none of the providers can be installed
  - cannot install the best candidate for the job
  - package kmod-nvidia-latest-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering
  - package kmod-nvidia-open-dkms-3:515.48.07-1.el8.x86_64 is filtered out by modular filtering

You will notice that the dkms issues has been resolved. Try not using the nvidia-driver:latest

# dnf module install nvidia-driver
 Package                                Architecture        Version                                           Repository                      Size
 bcc                                    x86_64              0.19.0-5.el8                                      appstream                      674 k
 bcc-tools                              x86_64              0.19.0-5.el8                                      appstream                      447 k
 bpftrace                               x86_64              0.12.1-4.el8                                      appstream                      1.3 M
 clang-libs                             x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                       23 M
 clang-resource-filesystem              x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                       13 k
 compiler-rt                            x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                      4.2 M
 libglvnd                               x86_64              1:1.3.4-1.el8                                     appstream                      126 k
 libglvnd-egl                           x86_64              1:1.3.4-1.el8                                     appstream                       48 k
 libglvnd-gles                          x86_64              1:1.3.4-1.el8                                     appstream                       39 k
 libglvnd-glx                           x86_64              1:1.3.4-1.el8                                     appstream                      136 k
 libomp-devel                           x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                       28 k
 llvm-libs                              x86_64              13.0.1-1.module+el8.6.0+825+7e27476a              appstream                       24 M
 mesa-dri-drivers                       x86_64              21.3.4-1.el8                                      appstream                       11 M
 mesa-filesystem                        x86_64              21.3.4-1.el8                                      appstream                       33 k
 mesa-libxatracker                      x86_64              21.3.4-1.el8                                      appstream                      2.0 M
 python3-bcc                            x86_64              0.19.0-5.el8                                      appstream                       89 k
Installing group/module packages:
 cuda-drivers                           x86_64              515.48.07-1                                       cuda-rhel8-x86_64              8.1 k
 kmod-nvidia-latest-dkms                x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               30 M
 nvidia-driver                          x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               23 M
 nvidia-driver-NVML                     x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              462 k
 nvidia-driver-NvFBCOpenGL              x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               54 k
 nvidia-driver-cuda                     x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              455 k
 nvidia-driver-cuda-libs                x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               54 M
 nvidia-driver-devel                    x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               13 k
 nvidia-driver-libs                     x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              177 M
 nvidia-kmod-common                     noarch              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               13 k
 nvidia-libXNVCtrl                      x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               26 k
 nvidia-libXNVCtrl-devel                x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               56 k
 nvidia-modprobe                        x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               37 k
 nvidia-persistenced                    x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64               43 k
 nvidia-settings                        x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              835 k
 nvidia-xconfig                         x86_64              3:515.48.07-1.el8                                 cuda-rhel8-x86_64              106 k
Installing dependencies:
 dnf-plugin-nvidia                      noarch              2.0-1.el8                                         cuda-rhel8-x86_64               12 k
 egl-wayland                            x86_64              1.1.9-3.el8                                       appstream                       39 k
 libX11-devel                           x86_64              1.6.8-5.el8                                       appstream                      975 k
 libXau-devel                           x86_64              1.0.9-3.el8                                       appstream                       19 k
 libglvnd-opengl                        x86_64              1:1.3.4-1.el8                                     appstream                       46 k
 libvdpau                               x86_64              1.4-2.el8                                         appstream                       40 k
 libxcb-devel                           x86_64              1.13.1-1.el8                                      appstream                      1.1 M
 mesa-vulkan-drivers                    x86_64              21.3.4-1.el8                                      appstream                      6.7 M
 ocl-icd                                x86_64              2.2.12-1.el8                                      appstream                       50 k
 opencl-filesystem                      noarch              1.0-6.el8                                         appstream                      7.3 k
 vulkan-loader                          x86_64                                       appstream                      133 k
 xorg-x11-proto-devel                   noarch              2020.1-3.el8                                      appstream                      279 k
Installing module profiles:
Enabling module streams:
 nvidia-driver                                              latest-dkms


Finally do a

# nvidia-smi
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-PCI...  Off  | 00000000:A3:00.0 Off |                    0 |
| N/A   49C    P0    46W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
|   1  NVIDIA A100-PCI...  Off  | 00000000:C3:00.0 Off |                    0 |
| N/A   53C    P0    46W / 250W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

Installing Nvidia Drivers on Rocky Linux 8.5

If you are planning to install Nvidia Drivers on Rocky Linux 8.5, you may want to use DNF to install. For a detailed explanation Streamlining NVIDIA Driver Deployment on RHEL 8 with Modularity Streams

Step 1: Add Offical Nvidia Repository to Package Managers repository list.

# dnf config-manager --add-repo=https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo

Step 2: Install Kernel-Devel and Headers used by the Drivers

# dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

Step 3: Installing Nvidia Drivers and Settings

# dnf install nvidia-driver nvidia-settings

Step 4: Install CUDA Drivers and REboot

# dnf install cuda-driver

Once done, do a reboot,

# reboot

If after a reboot and if you do a “nvidia-smi” and receive an error like the one

# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

You may want to take a look at https://gist.github.com/espoirMur/65cec3d67e0a96e270860c9c276ab9fa. It could be coming Secure Boot Option in your BIOS.

Install Nvidia Drivers on CentOS 7

Getting Information on Nvidia GPU on CentOS 7

# lspci | grep -i --color 'vga\|3d\|2d'
02:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200e [Pilot] ServerEngines (SEP1) (rev 42)
86:00.0 VGA compatible controller: NVIDIA Corporation GP102GL [Quadro P6000] (rev a1)
# lshw -class display
       description: VGA compatible controller
       product: MGA G200e [Pilot] ServerEngines (SEP1)
       vendor: Matrox Electronics Systems Ltd.
       physical id: 0
       bus info: pci@0000:02:00.0
       version: 42
       width: 32 bits
       clock: 33MHz
       capabilities: pm msi vga_controller bus_master cap_list rom
       configuration: driver=mgag200 latency=0
       resources: irq:16 memory:d3000000-d3ffffff memory:d4a10000-d4a13fff memory:d4000000-d47fffff memory:d4a00000-d4a0ffff
       description: VGA compatible controller
       product: GP102GL [Quadro P6000]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:86:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: iomemory:3df0-3def iomemory:3df0-3def irq:320 memory:ec000000-ecffffff memory:3dfe0000000-3dfefffffff memory:3dff0000000-3dff1ffffff ioport:c000(size=128) memory:ed000000-ed07ffff

Nvidia Downloads Site

From the Information, Download the Drivers from Nvidia Download Page

Yum Install Libraries and Dependencies

# yum group install "Development Tools"
# yum install kernel-devel
# yum install epel-release
# yum install dkms

Disable Noveau Drivers

Disable nouveau driver by changing the configuration /etc/default/grub file. Add the nouveau.modeset=0 into line starting with GRUB_CMDLINE_LINUX. This will disable the noveau driver after the reboot.

GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_CMDLINE_LINUX="crashkernel=auto rhgb quiet nouveau.modeset=0"

Modifying the Grub.cfg

For BIOS User,

# grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-957.5.1.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-957.5.1.el7.x86_64.img
Found linux image: /boot/vmlinuz-3.10.0-957.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-957.el7.x86_64.img
Found linux image: /boot/vmlinuz-0-rescue-86f557f292e5492aa7ac0bf1cb2670b0
Found initrd image: /boot/initramfs-0-rescue-86f557f292e5492aa7ac0bf1cb2670b0.img

For UEFI User

# grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg

Switch CentOS from GUI to Text Mode

First switch to Text Mode

# systemctl isolate multi-user.target

Installing the Nvidia Driver on CentOS 7

# bash NVIDIA-Linux-x86_64-*

Reboot the System

# reboot

Finally, run the command nvidia-settings to check and configure

# nvidia-settings
