Installing RoCE using Mellanox (Nvidia) OFED package

Prerequisites:

Do read Basic Understanding RoCE and Infiniband

Step 1: Install Mellanox Package

First and Foremost, you have to install Mellanox Package which you can download at https://developer.nvidia.com/networking/ethernet-software. You may want to consider installing using the traditional method or Ansible Method (Installing Mellanox OFED (mlnx_ofed) packages using Ansible)

Step 2: Load the Drivers

Activate two kernel modules that are needed for rdma and RoCE exchanges by using the command

# modprobe rdma_cm ib_umad

Step 3: Verify the drivers are loaded

# ibv_devinfo

Step 4: Set the RoCE to version 2

Set the version of the RoCE protocol to v2 by issuing the command below.

  • -d is the device, 
  • -p is the port 
  • -m the version of RoCE:
[root@node1]# cma_roce_mode -d mlx5_0 -p 1 -m 2
RoCE v2

Step 5: Check which RoCE devices are enabled on the Ethernet

[root@node-1]# ibdev2netdev
mlx5_0 port 1 ==> ens1f0 (Up)
mlx5_1 port 1 ==> ens1f1 (Down)

Refererences:

  1. Setting up a RoCE cluster

Encountering shm_open permission denied issues with hpcx

If you are using Nvidia hpc-x and encountering issues like the one below during your MPI Run

shm_open(file_name=/ucx_shm_posix_77de2cf3 flags=0xc2) failed: Permission denied

The error message indicates that the shared memory has no permission to be used,  The permission of /dev/shm is found to be 755, not 777, causing the error. The issue can be resolved after the permission is changed to 777. To change and verify the changes:

% chmod 777 /dev/shm 
% ls -ld /dev/shm
drwxrwxrwx 2 root root 40 Jul  6 15:18 /dev/sh

EOL notice for Mellanox ConnectX-5 VPI host channel adapters and Switch-IB 2 based EDR InfiniBand Switches

Nvidia Corporation has announced the EOL Notice #LCR-000906 – MELLANOX

PCN INFORMATION:
PCN Number: LCR-000906 – MELLANOX
PCN Description: EOL notice for Mellanox ConnectX-5 VPI host channel adapters and Switch-IB 2 based EDR InfiniBand Switches
Publish Date: Sun May 08 00:00:00 GMT 2022
Type: FYI

Performance Required for Deep Learning

There is this question that I wanted to find out about deep learning. What are essential System, Network, Protocol that will speed up the Training and/or Inferencing. There may not be necessary to employ the same level of requirements from Training to Inferencing and Vice Versa. I have received this information during a Nvidia Presentation

Training:

  1. Scalability requires ultra-fast networking
  2. Same hardware needs as HPC
  3. Extreme network bandwidth
  4. RDMA
  5. SHARP (Mellanox Scalable Hierarchical Aggregation and Reduction Protocol)
  6. GPUDirect (https://developer.nvidia.com/gpudirect)
  7. Fast Access Storage

Influencing

  1. Highly Transactional
  2. Ultra-low Latency
  3. Instant Network Response
  4. RDMA
  5. PeerDirect, GPUDirect

 

 

Best Practices to Secure the Edge Cloud Environment

In this webinar you will learn:

  • Challenges in securing edge data centers
  • How to secure the edge cloud without compromising on application performance
  • The role of NVIDIA Mellanox DPU in securing cloud to edge

Date: Aug 4, 2020
Time: 2:00pm SGT | 11:30am IST | 4:00pm AEST

To register: https://www.mellanox.com/webinar/best-practices-secure-edge-cloud-environment

 

Installing and using Mellanox HPC-X Software Toolkit

Overview

Taken from Mellanox HPC-X Software Toolkit User Manual 2.3

Mellanox HPC-X is a comprehensive software package that includes MPI and SHMEM communication libraries. HPC-X includes various acceleration packages to improve both the performance and scalability of applications running on top of these libraries, including UCX (Unified Communication X) and MXM (Mellanox Messaging), which accelerate the underlying send/receive (or put/get) messages. It also includes FCA (Fabric Collectives Accelerations), which accelerates the underlying collective operations used by the MPI/PGAS languages.

Download

https://www.mellanox.com/products/hpc-x-toolkit

Installation

% tar -xvf hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0-redhat7.6-x86_64.tbz
% cd hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0-redhat7.6-x86_64
% export HPCX_HOME=/usr/local/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0-redhat7.6-x86_64

Loading HPC-X Environment from BASH

HPC-X includes Open MPI v4.0.x. Each Open MPI version has its own module file which can be used to load the desired version

% source $HPCX_HOME/hpcx-init.sh
% hpcx_load
% env | grep HPCX
% mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c
% mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
% oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/
% hello_oshmem_c
% oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c
% hpcx_unload

Loading HPC-X Environment from Modules

You can use the already built module files in hpcx.

% module use $HPCX_HOME/modulefiles
% module load hpcx
% mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c
% mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
% oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/
hello_oshmem_c
% oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c
% module unload hpcx

Building HPC-X with the Intel Compiler Suite

Do take a look at the Mellanox HPC-X® ScalableHPC Software Toolkit

References:

  1. Mellanox HPC-X Software Toolkit User Manual 2.3
  2. Mellanox HPC-X® ScalableHPC Software Toolkit

Fabric Debug Initiation using ibdiagnet (Part 1)

Learn some of these steps from Mellanox Academy Online Training

Step 1: Clear all counters and begin the test execution

ibdiagnet -pc

Wait for a while. Usually, it may take 30 to 60 mins

Check for errors that exceed the allowed threshold

ibdiagnet -ls 25 -lw 4x -P all=1 --pm_pause_time 30
  • Specify the link speed
    -ls <2.5|5|10|14|25|50> 
  • Specify the Link width
    -lw <1x|4x|8x|12x>
  • Check Information provide from all counters and display each one of them crossing threshold of 1
    -P all=1
  • The time between the two samples is set by the –pm_pause_time option

Webinar – Build the Most Powerful Data Center with GPU Computing Technology and High-speed Interconnect

Build the Most Powerful Data Center with GPU Computing Technology and High-speed Interconnect

Date: Thursday, June 11, 2020
Time: 11:00am-12:30pm Singapore Time

Register here 

Please join NVIDIA as we discuss how to design a well-balanced system that maximizes performance and scalability of various workloads using NVIDIA GPUs and interconnect

Speakers will provide an overview of the state-of-the-art NVIDIA GPU accelerated compute architecture and In-Network computing fabric and how they come together with one goal: to deliver a solution that democratizes supercomputing power, making it readily accessible, installable, and manageable in a modern business setting. To learn more about this webinar click here