Encountering shm_open permission denied issues with hpcx

If you are using Nvidia hpc-x and encountering issues like the one below during your MPI Run

shm_open(file_name=/ucx_shm_posix_77de2cf3 flags=0xc2) failed: Permission denied

The error message indicates that the shared memory has no permission to be used,  The permission of /dev/shm is found to be 755, not 777, causing the error. The issue can be resolved after the permission is changed to 777. To change and verify the changes:

% chmod 777 /dev/shm 
% ls -ld /dev/shm
drwxrwxrwx 2 root root 40 Jul  6 15:18 /dev/sh

EOL notice for Mellanox ConnectX-5 VPI host channel adapters and Switch-IB 2 based EDR InfiniBand Switches

Nvidia Corporation has announced the EOL Notice #LCR-000906 – MELLANOX

PCN INFORMATION:
PCN Number: LCR-000906 – MELLANOX
PCN Description: EOL notice for Mellanox ConnectX-5 VPI host channel adapters and Switch-IB 2 based EDR InfiniBand Switches
Publish Date: Sun May 08 00:00:00 GMT 2022
Type: FYI

A relook at InfiniBand and Ethernet Trends on Top500

I have put up a article from Nvidia Perspective on the Top 500 Interconnect Trends. There is another article put up by the NextPlatform that took a closer look at the Infiniband and Ethernet Trends

Taken from The Next Platform “The Eternal Battle Between Infiniband and Ethernet”

The penetration of Ethernet rises as the list fans out, as you might expect, with many academic and industry HPC systems not being able to afford InfiniBand or not willing to switch away from Ethernet. And as those service providers, cloud builders, and hyperscalers run Linpack on small portions of their clusters for whatever political or business reasons they have. Relatively slow Ethernet is popular in the lower half of the Top500 list, and while InfiniBand gets down there, its penetration drops from 70 percent in the Top10 to 34 percent in the complete Top500.

Nvidia’s InfiniBand has 34 percent share of Top500 interconnects, with 170 systems, but what has not been obvious is the rise of Mellanox Spectrum and Spectrum-2 Ethernet switches on the Top500, which accounted for 148 additional systems. That gives Nvidia a 63.6 percent share of all interconnects on the Top500 rankings. That is the kind of market share that Cisco Systems used to enjoy for two decades in the enterprise datacenter, and that is quite an accomplishment.

Taken from The Next Platform “The Eternal Battle Between Infiniband and Ethernet”

References:

The Eternal Battle Between Infiniband and Ethernet

UDP Tuning to maximise performance

There is a interesting article how your UDP traffic can maximise performance with a few tweak. The article is taken from UDP Tuning

The most important factors as mentioned in the article is

  • Use jumbo frames: performance will be 4-5 times better using 9K MTUs
  • packet size: best performance is MTU size minus packet header size. For example for a 9000Byte MTU, use 8972 for IPV4, and 8952 for IPV6.
  • socket buffer size: For UDP, buffer size is not related to RTT the way TCP is, but the defaults are still not large enough. Setting the socket buffer to 4M seems to help a lot in most cases
  • core selection: UDP at 10G is typically CPU limited, so its important to pick the right core. This is particularly true on Sandy/Ivy Bridge motherboards.

Do take a look at the article UDP Tuning

Performance Required for Deep Learning

There is this question that I wanted to find out about deep learning. What are essential System, Network, Protocol that will speed up the Training and/or Inferencing. There may not be necessary to employ the same level of requirements from Training to Inferencing and Vice Versa. I have received this information during a Nvidia Presentation

Training:

  1. Scalability requires ultra-fast networking
  2. Same hardware needs as HPC
  3. Extreme network bandwidth
  4. RDMA
  5. SHARP (Mellanox Scalable Hierarchical Aggregation and Reduction Protocol)
  6. GPUDirect (https://developer.nvidia.com/gpudirect)
  7. Fast Access Storage

Influencing

  1. Highly Transactional
  2. Ultra-low Latency
  3. Instant Network Response
  4. RDMA
  5. PeerDirect, GPUDirect

 

 

Cumulus in the Cloud Demo

Cumulus in the Cloud offers a free, personal, virtual data center network that provides a low-effort way to see Cumulus Networks technology in action and to learn about the latest open innovations that can help you improve network designs and operations.

Your virtual data center consists of two racks with two dual-homed servers connected with a leaf-spine network. The infrastructure can be personalized with production-ready automation or left unconfigured as a “blank slate”.

For more information, see https://cumulusnetworks.com/products/cumulus-in-the-cloud/

 

Best Practices to Secure the Edge Cloud Environment

In this webinar you will learn:

  • Challenges in securing edge data centers
  • How to secure the edge cloud without compromising on application performance
  • The role of NVIDIA Mellanox DPU in securing cloud to edge

Date: Aug 4, 2020
Time: 2:00pm SGT | 11:30am IST | 4:00pm AEST

To register: https://www.mellanox.com/webinar/best-practices-secure-edge-cloud-environment