NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver for RHEL 8

If you have installed the CUDA Drivers and CUDA SDK using the NVIDIA CUDA Installation Guide for Linux. Look for Section 3.3.3 for RHEL 8 / Rocky 9

If after following instruction, you are still facing issues, you may want to consider the following

1- Blacklist nouveau.conf

$ vim /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
options nouveau modeset=0

2- Remove Nvidia driver installation

# dnf module remove --all nvidia-driver

3- Remove CUDA-Related Installation

sudo dnf remove "cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
 "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*"

4- Reboot

# shutdown -r now

References:

  1. Forum – CentOS Stream 8: NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver

Guide to Creating Symbolic Links with Ansible

You can use the ansible.builtin.file module. In my example below, I wanted to link the Module Environment profile.csh and profile.sh to be placed on the /etc/profile.d so that it will load on startup. Do take a look at the Ansible Document ansible.builtin.file module – Manage files and file properties

- name: Check for CUDA Link
  stat: path=/usr/local/cuda
  register: link_available

- name: Create a symbolic link for CUDA
  ansible.builtin.file:
    src: /usr/local/cuda-12.2
    dest: /usr/local/cuda
    owner: root
    group: root
    state: link
  when:
    - ansible_os_family == "RedHat"
    - ansible_distribution_major_version == "8"
    - link_available.stat.isdir is not defined and link_available.stat.isdir == False

Compiling VASP.6.3.0 using Nvidia hpcx, gcc-12.3 and MKL on Rocky Linux 8.7

For information, do take a look at VASP – Install VASP.6.X.X

VASP support several compilers. But we will be focusing on Nvidia hpcx only for this blog. To compile Nvidia hpcx, do take a look at Installing and using Mellanox HPC-X Software Toolkit

You may want to use Nvida hpcx. You may want to module use

export HPCX_HOME=/usr/local/hpcx-v2.15-gcc-MLNX_OFED_LINUX-5-redhat8-cuda12-gdrcopy2-nccl2.17-x86_64
module use $HPCX_HOME/modulefiles
----------------- /usr/local/hpcx-v2.15 --------------------------------------------
hpcx  hpcx-debug  hpcx-debug-ompi  hpcx-mt  hpcx-mt-ompi  hpcx-ompi  hpcx-prof  hpcx-prof-ompi  hpcx-stack

You can untar the VASP.6.3.3. and potpaw_PBE.54

% tar -xvf vasp.6.3.0.tar
% tar -xvf potpaw_PBE.54.tar 

At the installation base of vasp.6.3.0 base

% cp arch/makefile.include.gnu_ompi_mkl_omp ./makefile.include

You will need the latest GNU GCC-10 or the latest for a successful compile. I compiled with GCC-12.3. You may want to take a look at Compiling GCC 12.1.0 on Rocky Linux 8.5

If you are using OneAPI Intel MKL, you can use module use after compilation. It will not be covered in this write-up. But you can:

% module use /usr/local/intel/2023.1/modulefiles

Finally,

% module load mkl/latest
% module load gnu/gcc-12.3
% module load hpcx
% make veryclean
% make DEPS=1 -j
.....
.....
make[2]: Leaving directory '/usr/local/vasp/vasp.6.3.0/build/std'
make[1]: Leaving directory '/usr/local/vasp/vasp.6.3.0/build/std'

In the bin directory, you should see vasp_gam vasp_ncl vasp_std

Using Ansible Expect Module to executes a command and responds to prompts

Ansible Documentation:

Ansible Expect Module is very useful to listen for certain strings in stdout and react accordingly. This is particularly useful if you have to respond to accept a license agreement or enter some important information. Here is my sample

- name: Install RPM package from local system
  yum:
    name: /tmp/my-software.rpm
    state: present
    disable_gpg_check: true
  when: ansible_os_family == "RedHat"

- name:
  ansible.builtin.stat:
    path: /usr/local/mysoftware
  register: directory_check

- name: Setup Licensing Server's Connection if directory does not exist
  ansible.builtin.expect:
    command: /usr/local/mysoftware/install.sh
    responses:
      (?i)Do you already have a license server on your network? [y/N] "y"
      (?i)Enter the name (or IP address) of your license server "xx.xx.xx.xx"
      (?i)Install/update the MySoftware web service? [Y/n] "n"
  when: not directory_check.stat.isdir

Compiling OpenMPI-4.1.5 for ROCEv2 with GNU-8.5

https://docs.open-mpi.org/en/v5.0.x/release-notes/networks.html

Prerequisites 1

First thing first, You may want to check whether you are using RoCE. Do take a look at Installing RoCE using Mellanox (Nvidia) OFED package

Prerequisites 2

Do check whether you have ucx. You can do a dnf install

# dnf install ucx ucx-devel

Alternatively, you can do a manual install. For information on how to install, do take a look at http://openucx.org/wp-content/uploads/UCX_install_guide.pdf

# wget https://github.com/openucx/ucx/releases/download/v1.4.0/ucx-1.4.0.tar.gz
$ tar xzf ucx-1.4.0.tar.gz
$ cd ucx-1.4.0
$ ./contrib/configure-release --prefix=/usr/local/ucx-1.4.0
$ make -j8 
$ make install

Prerequisites 3

Make sure you have install GNU and GNU-C++. This can be done easily using the

# dnf install gcc-c++ gcc

Step 1: Download the OpenMPI package

You can go to OpenMPI to download the latest package at (https://www.open-mpi.org/software/ompi/v4.1/). The latest one at the point of writing is OpenMPI-4.1.

Step 2: Compile the Package

$ ./configure --prefix=/usr/local/openmpi-4.1.5 --enable-mpi-cxx --with-devel-headers --with-ucx --with-verbs --with-slurm=no
$ make && make install

Step 3: To run the MPIRUN using ROCE, do the following.

You may want to see Network Support Information on OpenMPI

$ mpirun --np 12 --hostfile path/to/hostfile --mca pml ucx -x -x UCX_NET_DEVICES=mlx5_0:1 ........

References:

  1. Setting up a RoCE cluster
  2. OpenMPI – Network Support
  3. How do I run Open MPI over RoCE? (UCX PML)

Installing RoCE using Mellanox (Nvidia) OFED package

Prerequisites:

Do read Basic Understanding RoCE and Infiniband

Step 1: Install Mellanox Package

First and Foremost, you have to install Mellanox Package which you can download at https://developer.nvidia.com/networking/ethernet-software. You may want to consider installing using the traditional method or Ansible Method (Installing Mellanox OFED (mlnx_ofed) packages using Ansible)

Step 2: Load the Drivers

Activate two kernel modules that are needed for rdma and RoCE exchanges by using the command

# modprobe rdma_cm ib_umad

Step 3: Verify the drivers are loaded

# ibv_devinfo

Step 4: Set the RoCE to version 2

Set the version of the RoCE protocol to v2 by issuing the command below.

  • -d is the device, 
  • -p is the port 
  • -m the version of RoCE:
[root@node1]# cma_roce_mode -d mlx5_0 -p 1 -m 2
RoCE v2

Step 5: Check which RoCE devices are enabled on the Ethernet

[root@node-1]# ibdev2netdev
mlx5_0 port 1 ==> ens1f0 (Up)
mlx5_1 port 1 ==> ens1f1 (Down)

Refererences:

  1. Setting up a RoCE cluster

Basic Understanding RoCE and Infiniband

Prerequisites:

  1. RoCE required Compliant Ethernet. Currently, I am using Mellanox ConnectX-6 Cards
  2. RoCE required a Compliant Switch. I used Mellanox 100G Switch.

The Difference between Traditional Ethernet Communication and RoCE can be explained very clearly in the diagram taken by Huawei’s Basic Knowledge and Differences of RoCE, IB, and TCP Networks

Some Key Pointers on the difference between TCP/IP and RDMA

  1. The Traditional TCP/IP network communication uses the Kernel to send messages which have high data movement and data replication overhead.
  2. RDMA can bypass the kernel and access the memory directly which allows low-latency network communication.

There are 3 types of RDMA network technologies is so neatly presented in Basic Knowledge and Differences of RoCE, IB, and TCP Networks

References:

  1. Basic Knowledge and Differences of RoCE, IB, and TCP Networks