Nvidia-smi slow startup fix

If you encounter slow nvidia-smi before the information is shown. For my 8 x A40 Cards, it took about 26 seconds to initialise.

The reason for slow initialization might be due to the driver persistence issue. For more background on the issue, do take a look at Nvidia Driver Persistence. According to the article,

The NVIDIA GPU driver has historically followed Unix design philosophies by only initializing software and hardware state when the user has configured the system to do so. Traditionally, this configuration was done via the X Server and the GPUs were only initialized when the X Server (on behalf of the user) requested that they be enabled. This is very important for the ability to reconfigure the GPUs without a reboot (for example, changing SLI mode or bus settings, especially in the AGP days).

More recently, this has proven to be a problem within compute-only environments, where X is not used and the GPUs are accessed via transient instantiations of the Cuda library. This results in the GPU state being initialized and deinitialized more often than the user truly wants and leads to long load times for each Cuda job, on the order of seconds.

NVIDIA previously provided Persistence Mode to solve this issue. This is a kernel-level solution that can be configured using nvidia-smi. This approach would prevent the kernel module from fully unloading software and hardware state when no user software was using the GPU. However, this approach creates subtle interaction problems with the rest of the system that have made maintenance difficult.

The purpose of the NVIDIA Persistence Daemon is to replace this kernel-level solution with a more robust user-space solution. This enables compute-only environments to more closely resemble the historically typical graphics environments that the NVIDIA GPU driver was designed around.

Nvidia Driver Persistence

The Solution is very easy. Just start and enable nvidia-persistenced

# systemctl enable nvidia-persistenced
# systemctl start nvidia-persistenced

Immediately, the nvidia-smi command becomes more responsive

Ganglia and Gmond Python module for GPUs

If you are running a cluster with NVIDIA GPUs, there now exists a python module for monitoring NVIDIA GPUs using the newly released Python bindings for NVML (NVIDIA Management Library). These bindings are under BSD license and allow simplified access to GPU metrics like temperature, memory usage, and utilization.

Nvidia Developer – Ganglia Monitoring System

To install the Ganglia plug-in on your Ganglia installation, see these download links:

For more information see:

Acknowledgements:

Compiling VASP.6.3.0 with GPGPU Capability using Nvidia HPC-SDK on Rocky Linux 8.5

To Compile VASP with GPGPU Capability using Nvidia HPC-SDK. For more information, do take a look at VASP – Install VASP.6.X.X

VASP support several compilers. But we will be focusing on Nvidia HPC-SDK only for this blog. To download the NVIDIA HPC-SDK

To compile Nvidia HPC SDK, do take a look at HPC SDK Documentation

% tar -xpfz <tarfile>.tar.gz

You may want to use modulefiles provided at hpc-sdk if you are using Module Environment

% module use /usr/local/nvidia/hpc_sdk/modulefiles

You should be able to see something like

------------------- /usr/local/nvidia/hpc_sdk/modulefiles ---------------
nvhpc-byo-compiler/22.5  nvhpc-nompi/22.5  nvhpc/22.5

You can untar the VASP.6.3.3. and potpaw_PBE.54

% tar -xvf vasp.6.3.0.tar
% tar -xvf potpaw_PBE.54.tar 

At the installation base of vasp.6.3.0 base

% cp arch/makefile.include.nvhpc_ompi_mkl_omp_acc ./makefile.include

Load the Nvidia GPGPU SDK and compile. If you are using OneAPI Intel Compilers, you can use module use after compilation. It will not be covered in this write-up.

% module use /usr/local/intel/oneapi-2022/modulefiles
% module load nvhpc/22.5
% module load mkl/latest
% make veryclean
% make DEPS=1 -j

If during the make, you encounter the error

/usr/local/nvidia/hpc_sdk/Linux_x86_64/22.5/comm_libs/openmpi/openmpi-3.1.5/bin/.bin/mpif90: error while loading shared libraries: libatomic.so.1: cannot open shared object file: No such file or directory

You can dnf install libatomic

% dnf install libatomic -y

Try Compiling again

References:

  1. Installing VASP.6.X.X

Compiling Gromacs-2019.3 with Intel MKL and CUDA

Prerequisites

GCC-6.5 Compilers and associates libraries
m4-1.4.18
mpfr-3.1.4
cmake-3.15.1
gmp-6.1.0
mpc-1.0.3

Intel Compilers and Prerequisites

% source /usr/local/intel/2018u3/bin/compilervars.sh intel64
% source /usr/local/intel/2018u3/impi/2018.3.222/bin64/mpivars.sh intel64
% source /usr/local/intel/2018u3/mkl/bin/mklvars.sh intel64
% source /usr/local/intel/2018u3/parallel_studio_xe_2018/bin/psxevars.sh intel64
% MKLROOT=/usr/local/intel/2018u3/mkl

Create a setup file

% touch gromacs_gpgpu.sh

Put the following into the gromacs_cpu.sh

CC=mpicc CXX=mpicxx cmake .. -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DGMX_MPI=on -DGMX_FFT_LIBRARY=mkl
-DCMAKE_INSTALL_PREFIX=/usr/local/gromacs-2019.3_intel18_mkl_cuda10.1 -DREGRESSIONTEST_DOWNLOAD=ON
-DCMAKE_C_FLAGS:STRING="-cc=icc -O3 -xHost -ip"
-DCMAKE_CXX_FLAGS:STRING="-cxx=icpc -O3 -xHost -ip -I/usr/local/intel/2018u3/compilers_and_libraries_2018.3.222/linux/mpi/intel64/include/" 
-DGMX_GPU=on 
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.1
-DCMAKE_BUILD_TYPE=Release
-DCUDA_HOST_COMPILER:FILEPATH=/usr/local/intel/2018u3/compilers_and_libraries_2018.3.222/linux/bin/intel64/icpc
% ./gromacs_gpgpu.sh
% make
% make install

Testing and Verification

$ source /your/installation/prefix/here/bin/GMXRC
./gmxtest.pl all -np 2