Nvidia-smi slow startup fix

If you encounter slow nvidia-smi before the information is shown. For my 8 x A40 Cards, it took about 26 seconds to initialise.

The reason for slow initialization might be due to the driver persistence issue. For more background on the issue, do take a look at Nvidia Driver Persistence. According to the article,

The NVIDIA GPU driver has historically followed Unix design philosophies by only initializing software and hardware state when the user has configured the system to do so. Traditionally, this configuration was done via the X Server and the GPUs were only initialized when the X Server (on behalf of the user) requested that they be enabled. This is very important for the ability to reconfigure the GPUs without a reboot (for example, changing SLI mode or bus settings, especially in the AGP days).

More recently, this has proven to be a problem within compute-only environments, where X is not used and the GPUs are accessed via transient instantiations of the Cuda library. This results in the GPU state being initialized and deinitialized more often than the user truly wants and leads to long load times for each Cuda job, on the order of seconds.

NVIDIA previously provided Persistence Mode to solve this issue. This is a kernel-level solution that can be configured using nvidia-smi. This approach would prevent the kernel module from fully unloading software and hardware state when no user software was using the GPU. However, this approach creates subtle interaction problems with the rest of the system that have made maintenance difficult.

The purpose of the NVIDIA Persistence Daemon is to replace this kernel-level solution with a more robust user-space solution. This enables compute-only environments to more closely resemble the historically typical graphics environments that the NVIDIA GPU driver was designed around.

Nvidia Driver Persistence

The Solution is very easy. Just start and enable nvidia-persistenced

# systemctl enable nvidia-persistenced
# systemctl start nvidia-persistenced

Immediately, the nvidia-smi command becomes more responsive

Enabling Nvidia Tesla 4 x A100 with NVLink for MPI

I was having issues with the Applications like NetKET to detect and enable MPI.

Diagnosis

  1. I have installed OpenMPI and enabled CUDA during the configuration.
  2. CUDA Libraries including nvidia-smi has been installed without issue. But running, nvidia-smi topo –matrix, I am not able to see NVLink similar to

In fact, when I run NetKet on CUDA with MPI, the error that was generated was

mpirun noticed that process rank 0 with PID 0 on node gpu1 exited on signal 11 (Segmentation fault)."

Solution

This forum entry provided some enlightenment. https://forums.developer.nvidia.com/t/cuda-initialization-error-on-8x-a100-gpu-hgx-server/250936

The solution was to disable the Multi-instance GPU Mode which is enabled by default. Reboot the Server and it should see

nvidia-smi -mig 0

Enabling Persistence Mode

Make sure the configuration stays after a reboot.

# systemctl enable nvidia-persistenced.service
# systemctl start nvidia-persistenced.service

Basic use of nvidia-smi commands

There is a very good article written by Microway on this utility. Take a look at nvidia-smi: Control Your GPUs

What is nvidia-smi?

nvidia-smi is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

Installation

Do take a look at NVIDIA CUDA Installation Guide for Linux for more information

Query GPU Status

$ nvidia-smi -L

Query overall GPU usage with 1-second update intervals

$ nvidia-smi dmon

Query System/GPU Topology and NVLink

$ nvidia-smi topo --matrix
$ nvidia-smi nvlink --status

Query Details of GPU Cards

$ nvidia-smi -i 0 -q

nvidia-smi – failed to initialize nvml: insufficient permissions

The Error Encountered

If you are a non-root user and you issue a command, you might see the error

% nvidia-smi
NVML: Insufficient Permissions" error

The default module option NVreg_DeviceFileMode=0660 set via /etc/modprobe.d/nvidia-default.conf. This causes the nvidia device nodes to have 660 permission.

vim /etc/modprobe.d/nvidia-default.conf
options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=1001 NVreg_DeviceFileMode=0660

The Fix

[user1@node1 dev]$ ls -l nvidia*
crw-rw---- 1 root vglusers 195,   0 Jan  5 17:07 nvidia0
crw-rw---- 1 root vglusers 195, 255 Jan  5 17:07 nvidiactl
crw-rw---- 1 root vglusers 195, 254 Jan  5 17:07 nvidia-modeset

The reason for the error is due to the vglusers or video group. The fix is simply putting the users in the /etc/group

# usermod -a -G vglusers user1 

Logged off and Login again, you should be able to do nvidia-smi

References:

A fix for the “NVML: Insufficient Permissions”