Performance Required for Deep Learning

There is this question that I wanted to find out about deep learning. What are essential System, Network, Protocol that will speed up the Training and/or Inferencing. There may not be necessary to employ the same level of requirements from Training to Inferencing and Vice Versa. I have received this information during a Nvidia Presentation


  1. Scalability requires ultra-fast networking
  2. Same hardware needs as HPC
  3. Extreme network bandwidth
  4. RDMA
  5. SHARP (Mellanox Scalable Hierarchical Aggregation and Reduction Protocol)
  6. GPUDirect (
  7. Fast Access Storage


  1. Highly Transactional
  2. Ultra-low Latency
  3. Instant Network Response
  4. RDMA
  5. PeerDirect, GPUDirect



Cumulus in the Cloud Demo

Cumulus in the Cloud offers a free, personal, virtual data center network that provides a low-effort way to see Cumulus Networks technology in action and to learn about the latest open innovations that can help you improve network designs and operations.

Your virtual data center consists of two racks with two dual-homed servers connected with a leaf-spine network. The infrastructure can be personalized with production-ready automation or left unconfigured as a “blank slate”.

For more information, see


Best Practices to Secure the Edge Cloud Environment

In this webinar you will learn:

  • Challenges in securing edge data centers
  • How to secure the edge cloud without compromising on application performance
  • The role of NVIDIA Mellanox DPU in securing cloud to edge

Date: Aug 4, 2020
Time: 2:00pm SGT | 11:30am IST | 4:00pm AEST

To register:


Installing and using Mellanox HPC-X Software Toolkit


Taken from Mellanox HPC-X Software Toolkit User Manual 2.3

Mellanox HPC-X is a comprehensive software package that includes MPI and SHMEM communication libraries. HPC-X includes various acceleration packages to improve both the performance and scalability of applications running on top of these libraries, including UCX (Unified Communication X) and MXM (Mellanox Messaging), which accelerate the underlying send/receive (or put/get) messages. It also includes FCA (Fabric Collectives Accelerations), which accelerates the underlying collective operations used by the MPI/PGAS languages.



% tar -xvf hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-
% cd hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-
% export HPCX_HOME=/usr/local/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-

Loading HPC-X Environment from BASH

HPC-X includes Open MPI v4.0.x. Each Open MPI version has its own module file which can be used to load the desired version

% source $HPCX_HOME/
% hpcx_load
% env | grep HPCX
% mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c
% mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
% oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/
% hello_oshmem_c
% oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c
% hpcx_unload

Loading HPC-X Environment from Modules

You can use the already built module files in hpcx.

% module use $HPCX_HOME/modulefiles
% module load hpcx
% mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c
% mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
% oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/
% oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c
% module unload hpcx

Building HPC-X with the Intel Compiler Suite

Do take a look at the Mellanox HPC-X® ScalableHPC Software Toolkit


  1. Mellanox HPC-X Software Toolkit User Manual 2.3
  2. Mellanox HPC-X® ScalableHPC Software Toolkit

Fabric Debug Initiation using ibdiagnet (Part 1)

Learn some of these steps from Mellanox Academy Online Training

Step 1: Clear all counters and begin the test execution

ibdiagnet -pc

Wait for a while. Usually, it may take 30 to 60 mins

Check for errors that exceed the allowed threshold

ibdiagnet -ls 25 -lw 4x -P all=1 --pm_pause_time 30
  • Specify the link speed
    -ls <2.5|5|10|14|25|50> 
  • Specify the Link width
    -lw <1x|4x|8x|12x>
  • Check Information provide from all counters and display each one of them crossing threshold of 1
    -P all=1
  • The time between the two samples is set by the –pm_pause_time option

Webinar – Build the Most Powerful Data Center with GPU Computing Technology and High-speed Interconnect

Build the Most Powerful Data Center with GPU Computing Technology and High-speed Interconnect

Date: Thursday, June 11, 2020
Time: 11:00am-12:30pm Singapore Time

Register here 

Please join NVIDIA as we discuss how to design a well-balanced system that maximizes performance and scalability of various workloads using NVIDIA GPUs and interconnect

Speakers will provide an overview of the state-of-the-art NVIDIA GPU accelerated compute architecture and In-Network computing fabric and how they come together with one goal: to deliver a solution that democratizes supercomputing power, making it readily accessible, installable, and manageable in a modern business setting. To learn more about this webinar click here