Installing Nvidia DOCA OFED Documentation from Nvidia for Rocky Linux

Taken from Installing Nvidia DOCA OFED. Do read the documentation for more information. Other relevan documentation will include

Quick Reference

Installation Profiles

DOCA-Host ProfileDescription
doca-ofedAllows you to install the same drivers and tools of MLNX_OFED using the DOCA-Host package, but without other DOCA functionality.
doca-networkIntended for users who want to use only the networking functionality of the DOCA-Host package.
doca-allIntended for users who want to use the full extent of DOCA drivers and libraries, the full DOCA-Host installation.
# Remove the installed DOCA OFED software from the host.
for f in $(rpm -qa | grep -i doca ) ; do sudo yum -y remove $f; done

# Remove the installed MLNC_OFED software.
sudo /usr/sbin/ofed_uninstall.sh --force

sudo dnf autoremove
sudo dnf clean all -y
sudo dnf makecache -y

Download and Install NVidia RPM GPG Key

sudo wget http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox-SHA256
sudo rpm --import RPM-GPG-KEY-Mellanox-SHA256

DOCA-OFED

At /etc/yum.repos.d/

touch /etc/yum.repos.d/doca.repo

Inside /etc/yum.repos.d/doca.repo, include the information

[doca]
name=DOCA Online Repo
baseurl=https://linux.mellanox.com/public/repo/doca/3.2.1/rhel8/x86_64/
enabled=1
gpgcheck=0

Save and Exit

Install DOCA-OFED

dnf install -y doca-ofed

Validating that OFED and ROCEV2 are working

One of the fastest commands is to use ibstat

CA 'mlx5_0'
	CA type: MT4127
	Number of ports: 1
	Firmware version: 26.43.2026
	Hardware version: 0
	Node GUID: 0x5000e6030073b514
	System image GUID: 0x5000e6030073b514
	Port 1:
		State: Down
		Physical state: Disabled
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x.....
		Port GUID: 0x......
		Link layer: Ethernet
CA 'mlx5_1'
	CA type: MT4127
	Number of ports: 1
	Firmware version: 26.43.2026
	Hardware version: 0
	Node GUID: 0x.....
	System image GUID: 0x.....
	Port 1:
		State: Active
		Physical state: LinkUp
		Rate: 25
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x.......
		Port GUID: 0x.....
		Link layer: Ethernet

You can use the following information to check further. Installing RoCE using Mellanox (Nvidia) OFED package

Checking Assigned Logical Name to Hardware Brand

Method 1: Using Ethernet and lspci

[root@hpc-node1 ~]# ethtool -i ens3f1np1
driver: mlx5_core
version: 25.10-1.7.1
firmware-version: 26.43.2026 (MT_0000000575)
expansion-rom-version: 
bus-info: 0000:5d:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
[root@hpc-node1 ~]lspci -s 0000:5d:00.1
0000:5d:00.1 Ethernet controller: Mellanox Technologies MT2894 Family [ConnectX-6 Lx]

Method 2: Using Ishw

[root@hpc-wfly-i022 ~]# lshw -C network
.....
.....
*-network:1
       description: Ethernet interface
       product: MT2894 Family [ConnectX-6 Lx]
       vendor: Mellanox Technologies
       physical id: 0.1
       bus info: pci@0000:5d:00.1
       logical name: ens3f1
       version: 00
       serial: 50:00:e6:73:b5:15
       capacity: 10Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical fibre 1000bt-fd 10000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=4.18.0-553.54.1.el8_10.x86_64 firmware=26.43.2026 (MT_0000000575) latency=0 link=no multicast=yes port=fibre
       resources: iomemory:1f3f0-1f3ef irq:17 memory:1f3ffa000000-1f3ffbffffff memory:b5f00000-b5ffffff
.....
.....

Using Mellanox ConnectX VPI Ports to Ethernet or InfiniBand

The Mellanox ConnectX5 VPI adapter supports both Ethernet and InfiniBand port modes, which must be configured.

Check Status

# mst status -v
MST modules:
------------
    MST PCI module is not loaded
    MST PCI configuration module is not loaded
PCI devices:
------------
DEVICE_TYPE             MST                           PCI       RDMA            NET                                     NUMA  
ConnectX4(rev:0)        /dev/mst/mt4115_pciconf3      8b:00.0   mlx5_3                                                  1     

ConnectX4(rev:0)        /dev/mst/mt4115_pciconf2      84:00.0   mlx5_2                                                  1     

ConnectX4(rev:0)        /dev/mst/mt4115_pciconf1      0c:00.0   mlx5_1                                                  0     

ConnectX4(rev:0)        /dev/mst/mt4115_pciconf0      05:00.0   mlx5_0                                                  0                                                 1    

Start MST

# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Create devices
Unloading MST PCI module (unused) - Success

Change the port type to Ethernet (LINK_TYPE = 2)

# mlxconfig -d /dev/mst/mt4115_pciconf2 set LINK_TYPE_P1=2

Check that the port type was changed to Ethernet

# ibdev2netdev
mlx5_0 port 1 ==> ens1np0 (Down)
mlx5_1 port 1 ==> enp12s0np0 (Down)
mlx5_2 port 1 ==> enp132s0np0 (Up)
mlx5_3 port 1 ==> enp139s0np0 (Down)

References:

Using NMCLI to manage Network on Rocky Linux 8

Point 1: View all the saved connections

# nmcli connection show
ens1f0     XXXX-XXXX-XXXX-XXXX-XXXX  ethernet  ens1f0
ens1f1     YYYY-YYYY-YYYY-YYYY-YYYY  ethernet  ens1f1
ens10f0    ZZZZ-ZZZZ-ZZZZ-ZZZZ-ZZZZ  ethernet  --
ens10f1    AAAA-AAAA-AAAA-AAAA-AAAA  ethernet  --

Point 2a: Stop Network

You can use the command “nmcli connection down ssid/uuid". For example

# nmcli connection down XXXX-XXXX-XXXX-XXXX-XXXX
Connection 'eno0' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/3)

Point 2b: Start Network

You can use the command “nmcli connection up ssid/uuid". For example

# nmcli connection up XXXX-XXXX-XXXX-XXXX-XXXX
Connection 'eno0' successfully activated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/3)

Point 3: Device Connection

To check the Device status

# nmcli dev status
ens1f0  ethernet  connected     ens1f0
eno1f1  ethernet  connected     ens1f1
eno10f0  ethernet  disconnected  --
eno10f1  ethernet  disconnected  --

Point 4: List all Device

# nmcli device show
GENERAL.DEVICE:                         ens1f0
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         XX:XX:XX:XX:XX:XX
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     ens1f0
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/2
WIRED-PROPERTIES.CARRIER:               on
IP4.ADDRESS[1]:                         192.168.0.1
IP4.GATEWAY:                            192.168.0.254
IP4.ROUTE[1]:                           dst = 0.0.0.0/0, nh = 192.168.0.254, mt = 101
IP4.ROUTE[2]:                           dst = 198.168.0.0/19, nh = 0.0.0.0, mt = 101
IP6.ADDRESS[1]:                         xxxx::xxxx:xxxx:xxxx:xxxx/64
IP6.GATEWAY:                            --
IP6.ROUTE[1]:                           dst = fe80::/64, nh = ::, mt = 1024

GENERAL.DEVICE:                         eno1f1
GENERAL.TYPE:                           ethernet
GENERAL.HWADDR:                         94:6D:AE:9B:76:1C
GENERAL.MTU:                            1500
GENERAL.STATE:                          100 (connected)
GENERAL.CONNECTION:                     eno1f1
GENERAL.CON-PATH:                       /org/freedesktop/NetworkManager/ActiveConnection/4
WIRED-PROPERTIES.CARRIER:               on
IP4.ADDRESS[1]:                         192.168.200.201/19
IP4.GATEWAY:                            --
IP4.ROUTE[1]:                           dst = 192.168.192.0/19, nh = 0.0.0.0, mt = 102
IP6.ADDRESS[1]:                         fe80::966d:aeff:fe9b:761c/64
IP6.GATEWAY:                            --
IP6.ROUTE[1]:                           dst = fe80::/64, nh = ::, mt = 1024

Point 5: Start and Stop Device

# nmcli con down ens1d1
# nmcli con up ens1d1

References:

  1. nmcli command in Linux with Examples

Installing RoCE using Mellanox (Nvidia) OFED package

Prerequisites:

Do read Basic Understanding RoCE and Infiniband

Step 1: Install Mellanox Package

First and Foremost, you have to install Mellanox Package which you can download at https://developer.nvidia.com/networking/ethernet-software. You may want to consider installing using the traditional method or Ansible Method (Installing Mellanox OFED (mlnx_ofed) packages using Ansible)

Step 2: Load the Drivers

Activate two kernel modules that are needed for rdma and RoCE exchanges by using the command

# modprobe rdma_cm ib_umad

Step 3: Verify the drivers are loaded

# ibv_devinfo

Step 4: Set the RoCE to version 2

Set the version of the RoCE protocol to v2 by issuing the command below.

  • -d is the device, 
  • -p is the port 
  • -m the version of RoCE:
[root@node1]# cma_roce_mode -d mlx5_0 -p 1 -m 2
RoCE v2

Step 5: Check which RoCE devices are enabled on the Ethernet

[root@node-1]# ibdev2netdev
mlx5_0 port 1 ==> ens1f0 (Up)
mlx5_1 port 1 ==> ens1f1 (Down)

Refererences:

  1. Setting up a RoCE cluster

Basic Understanding RoCE and Infiniband

Prerequisites:

  1. RoCE required Compliant Ethernet. Currently, I am using Mellanox ConnectX-6 Cards
  2. RoCE required a Compliant Switch. I used Mellanox 100G Switch.

The Difference between Traditional Ethernet Communication and RoCE can be explained very clearly in the diagram taken by Huawei’s Basic Knowledge and Differences of RoCE, IB, and TCP Networks

Some Key Pointers on the difference between TCP/IP and RDMA

  1. The Traditional TCP/IP network communication uses the Kernel to send messages which have high data movement and data replication overhead.
  2. RDMA can bypass the kernel and access the memory directly which allows low-latency network communication.

There are 3 types of RDMA network technologies is so neatly presented in Basic Knowledge and Differences of RoCE, IB, and TCP Networks

References:

  1. Basic Knowledge and Differences of RoCE, IB, and TCP Networks

Setting up 2 Gateways with a Default Gateway for most Traffic and the 2nd Gateway for selected Subnet Traffic on Rocky Linux 8

Issues:

Suppose you have 2 network cards and their own gateway. The challenge is that you can only have 1 default gateway. How do we work this out?

Solution:

Type the following command

$ ip route show
default via 192.168.1.254 dev eno0 proto static metric 104
192.168.2.0/24 via 192.168.2.254 dev eno1 proto static metric 103
10.10.1.0/24 via 192.168.2.254 dev eno1 proto static metric 103

That means the default route for traffic is via eno1. All traffic except 192.168.2.0 and 10.10.1.0 will pass through the second gateway. How do we do it?

Set Default Route for all traffic

To set all traffic through the default gateway, do the following

$ ip route add default via 192.168.1.254 dev eno0 proto static metric 104

Set Selected IP Subnet for 2nd Gateway

$ ip route add 192.168.2.0/24 via 192.168.2.254 dev eno1 proto static metric 103
$ ip route add 10.10.1.0/24 via 192.168.2.254 dev eno1 proto static metric 103

Setting the DNS Correctly for each Network Card

If each of the Network Cards requires a different DNS, do make sure you put in the /etc/sysconfig/network-scripts

$ vim /etc/sysconfig/network-scripts/ifcfg-eno0
....
....
DEVICE=eno0
ONBOOT=yes
IPADDR=192.168.1.1
GATEWAY=192.168.1.254
DNS1=192.168.1.252
DNS2=192.168.1.253
NETMASK=255.255.255.0
$ vim /etc/sysconfig/network-scripts/ifcfg-eno1
....
....
DEVICE=eno1
ONBOOT=yes
IPADDR=192.168.2.1
GATEWAY=192.168.2.254
DNS1=192.168.2.252
DNS2=192.168.2.253
NETMASK=255.255.255.0

Deleting Route from Table

ip route delete 192.168.2.0/24 via 192.168.2.254 dev eno1 proto static metric 103

Different DNS Servers and Different Domains (For RHEL 8)

You can configure dnsmasq service and NetworkManager to send DNS queries for a specific domain to a selected DNS server. The Information can be found in Chapter 38. Using different DNS servers for different domains

By default, Red Hat Enterprise Linux (RHEL) sends all DNS requests to the first DNS server specified in the /etc/resolv.conf file. If this server does not reply, RHEL uses the next server in this file.

In environments where one DNS server cannot resolve all domains, administrators can configure RHEL to send DNS requests for a specific domain to a selected DNS server. For example, you can configure one DNS server to resolve queries for example.com and another DNS server to resolve queries for example.net. For all other DNS requests, RHEL uses the DNS server configured in the connection with the default gateway.

Procedure 1: Install dnsmasq package

# dnf install dnsmasq

Procedure 2: Edit the /etc/NetworkManager/NetworkManager.conf file, and set the following entry in the [main] section:

dns=dnsmasq

Procedure 3: Reload the NetworkManager service:

# systemctl reload NetworkManager

Procedure 4: Verify that the nameserver entry in the /etc/resolv.conf file refers to 127.0.0.53:

# cat /etc/resolv.conf
nameserver 127.0.0.1

Procedure 5a: Verify using TCPDump Packet Sniffer

# dnf install tcpdump

Procedure 5b: On one terminal, start tcpdump to capture DNS traffic on all interfaces:

# tcpdump -i any port 53

Procedure 5c: On a different terminal, resolve host names for a domain for which an exception exists and another domain, for example:

# host -t A www.redhat.com
# host -t A www.MyInternalDomain.com

Verify in the tcpdump output that Rocky Linux sends only DNS queries for the http://www.redhat.com domain to the designated DNS server and through the corresponding interface and vice versa for the Internal Domain

References:

  1. Chapter 38. Using different DNS servers for different domains
  2. Two Default Gateways on One System
  3. Linux Set up Routing with IP Command

Understanding the Difference between QSFP, QSFP+, QSFP28

Sometimes I use these terms loosely. Here an article that explain the 3 fiber optic transceivers QSFP, QSFP+ and QSFP28

Taken from the article “Difference between QSFP, QSFP+, QSFP28

Here are some main points

  1. The QSFP specification supports Ethernet, Fibre Channel, InfiniBand and SONET/SDH standards with different data rate options.
  2. QSFP transceivers support the network link over singlemode or multimode fiber patch cable.
  3. Common ones are 4x10G QSFP+, 4x28G QSFP28
  4. QSFP+ are designed to support 40G Ethernet, Serial Attached SCSI, QDR (40G) and FDR (56G) Infiniband, and other communication standards
  5. QSFP+ modules integrate 4 transmit and 4 receive channels plus sideband signals. Then QSFP+ modules can break out into 4x10G lanes. 
  6. QSFP28 is a hot-pluggable transceiver module designed for 100G data rate.
  7. QSFP28 integrates 4 transmit and 4 receiver channels. “28” means each lane carries up to 28G data rate.
  8. QSFP28 can do 4x25G breakout connection, 2x50G breakout, or 1x100G depending on the transceiver used.
  9. Usually QSFP28 modules can’t break out into 10G links. But it’s another case to insert a QSFP28 module into a QSFP+ port if switches support.
  10. QSFP+ and QSFP28 modules can support both short and long-haul transmission.

Using firewall-cmd to configure gateways and isolated client network on CentOS-7 and Rocky Linux 8

Objectives:

Compute Nodes in an HPC environment are usually physically isolated from the public network and has to route through the gateway which are often found in Head Node or any delegated Node in small or small-medium size cluster to access the internet or to access company LAN to access LDAP, you can use the firewall-cmd to route the traffic through the interconnect facing the internet.

Scenario:

Traffic will be routed through the Head Node’s eno1 (internet facing) from the Head Node’s eno2 (private network). The interconnect eno1 is attached to a switch where the compute nodes are similarly attached. Some

  1. 192.168.1.0/24 is the private network subnet.
  2. 192.168.1.1 is the IP Address of the Head Node
  3. 155.1.1.2 is the IP Address of the external-facing ethernet ie eno1

Check the zones.

# firewall-cmd --list-all-zones

Check the Active Zones

# firewall-cmd --get-active-zones
external
  interfaces: eno2
internal
  interfaces: eno1

Enable masquerade at the Head Node’s External Zone

IP masquerading is a process where one computer acts as an IP gateway for a network. For masquerading, the gateway dynamically looks up the IP of the outgoing interface all the time and replaces the source address in the packets with this address.

You use masquerading if the IP of the outgoing interface can change. A typical use case for masquerading is if a router replaces the private IP addresses, which are not routed on the internet, with the public dynamic IP address of the outgoing interface on the router.

For more information. Do take a look at 5.10. Configuring IP Address Masquerading

# firewall-cmd --zone=external --query-masquerade 
no
# firewall-cmd --zone=external --add-masquerade --permanent
# firewall-cmd --reload

Compute Nodes at the Private Network 

(Assuming that eno1 is connected to the private switch). It is very important that you input the gateway at the compute node’s /etc/sysconfig/network-scripts/ifcfg-eno1)

.....
.....
DEVICE=enp47s0f1
ONBOOT=yes
IPADDR=192.168.1.2 #Internal IP Address of the Compute Node
NETMASK=255.255.255.0
GATEWAY=192.168.1.1 #Internal IP Address of the Head Node

Next, you have to put the Network Interface of the Client in the Internal Zone of the firewall-cmd. Assuming that eno1 is also used by the Client Network

# firewall-cmd --zone=internal --change-interface=eno1 --permanent

You may want to set the selinux to disabled

# setenforce 0

Configure the Head Node’s External Zone.

For Zoning, do take a look at 5.7.8. Using Zone Targets to Set Default Behavior for Incoming Traffic

For this setting, we have chosen target “default”

# firewall-cmd --zone=external --set-target=default

You can configure other settings. For the External Zone. For example, add SSH Service, mDNS

# firewall-cmd --permanent --zone=external --add-service=ssh
# firewall-cmd --permanent --zone=external --add-service=mdns
# firewall-cmd --runtime-to-permanent
# firewall-cmd --reload

Make sure the right Ethernet is placed in the right Zone. For External-Facing Ethernet Card, (eno2), you may want to place it

# firewall-cmd --zone=external --change-interface=eno2 --permanent

For Internal Facing Ethernet Card, (eno1), you want want to place it

# firewall-cmd --zone=internal --change-interface=eno1 --permanent

Configure the firewall-Source of Internal Network (eno1)

# firewall-cmd --zone=internal --add-source=192.168.1.0/24

Checking the Settings in the “firewall-cmd –get-active-zones”

# firewall-cmd --get-active-zones
internal (active)
  target: default
  icmp-block-inversion: no
  interfaces: eno1
  sources: 192.168.1.0/32
  services: dhcpv6-client mdns ssh
  ports:
  protocols:
  forward: no
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eno2
  sources:
  services: dhcpv6-client ssh
  ports: 
  protocols:
  forward: no
  masquerade: yes
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

Check the Firewall Status

systemctl status firewalld.service

Encountering shm_open permission denied issues with hpcx

If you are using Nvidia hpc-x and encountering issues like the one below during your MPI Run

shm_open(file_name=/ucx_shm_posix_77de2cf3 flags=0xc2) failed: Permission denied

The error message indicates that the shared memory has no permission to be used,  The permission of /dev/shm is found to be 755, not 777, causing the error. The issue can be resolved after the permission is changed to 777. To change and verify the changes:

% chmod 777 /dev/shm 
% ls -ld /dev/shm
drwxrwxrwx 2 root root 40 Jul  6 15:18 /dev/sh