Setting up 2 Gateways with a Default Gateway for most Traffic and the 2nd Gateway for selected Subnet Traffic on Rocky Linux 8

Issues:

Suppose you have 2 network cards and their own gateway. The challenge is that you can only have 1 default gateway. How do we work this out?

Solution:

Type the following command

$ ip route show
default via 192.168.1.254 dev eno0 proto static metric 104
192.168.2.0/24 via 192.168.2.254 dev eno1 proto static metric 103
10.10.1.0/24 via 192.168.2.254 dev eno1 proto static metric 103

That means the default route for traffic is via eno1. All traffic except 192.168.2.0 and 10.10.1.0 will pass through the second gateway. How do we do it?

Set Default Route for all traffic

To set all traffic through the default gateway, do the following

$ ip route add default via 192.168.1.254 dev eno0 proto static metric 104

Set Selected IP Subnet for 2nd Gateway

$ ip route add 192.168.2.0/24 via 192.168.2.254 dev eno1 proto static metric 103
$ ip route add 10.10.1.0/24 via 192.168.2.254 dev eno1 proto static metric 103

Setting the DNS Correctly for each Network Card

If each of the Network Cards requires a different DNS, do make sure you put in the /etc/sysconfig/network-scripts

$ vim /etc/sysconfig/network-scripts/ifcfg-eno0
....
....
DEVICE=eno0
ONBOOT=yes
IPADDR=192.168.1.1
GATEWAY=192.168.1.254
DNS1=192.168.1.252
DNS2=192.168.1.253
NETMASK=255.255.255.0
$ vim /etc/sysconfig/network-scripts/ifcfg-eno1
....
....
DEVICE=eno1
ONBOOT=yes
IPADDR=192.168.2.1
GATEWAY=192.168.2.254
DNS1=192.168.2.252
DNS2=192.168.2.253
NETMASK=255.255.255.0

Deleting Route from Table

ip route delete 192.168.2.0/24 via 192.168.2.254 dev eno1 proto static metric 103

Different DNS Servers and Different Domains (For RHEL 8)

You can configure systemd-resolved service and NetworkManager to send DNS queries for a specific domain to a selected DNS server. The Information can be found in Chapter 38. Using different DNS servers for different domains

By default, Red Hat Enterprise Linux (RHEL) sends all DNS requests to the first DNS server specified in the /etc/resolv.conf file. If this server does not reply, RHEL uses the next server in this file.

In environments where one DNS server cannot resolve all domains, administrators can configure RHEL to send DNS requests for a specific domain to a selected DNS server. For example, you can configure one DNS server to resolve queries for example.com and another DNS server to resolve queries for example.net. For all other DNS requests, RHEL uses the DNS server configured in the connection with the default gateway.

Procedure 1: Start and enable the systemd-resolved service:

# systemctl --now enable systemd-resolved

Procedure 2: Edit the /etc/NetworkManager/NetworkManager.conf file, and set the following entry in the [main] section:

dns=systemd-resolved

Procedure 3: Reload the NetworkManager service:

# systemctl reload NetworkManager

Procedure 4: Verify that the nameserver entry in the /etc/resolv.conf file refers to 127.0.0.53:

# cat /etc/resolv.conf
nameserver 127.0.0.53

Verify that the systemd-resolved service listens on port 53 on the local IP address 127.0.0.53:

# ss -tulpn | grep "127.0.0.53"
udp  UNCONN 0  0      127.0.0.53%lo:53   0.0.0.0:*    users:(("systemd-resolve",pid=1050,fd=12))
tcp  LISTEN 0  4096   127.0.0.53%lo:53   0.0.0.0:*    users:(("systemd-resolve",pid=1050,fd=13))

Add the /etc/resolv.conf for the 2nd DNS

# Generated by NetworkManager
nameserver 127.0.0.53
nameserver 192.168.0.1
options edns0 trust-ad
 

References:

  1. Chapter 38. Using different DNS servers for different domains
  2. Two Default Gateways on One System
  3. Linux Set up Routing with IP Command
Advertisement

Understanding the Difference between QSFP, QSFP+, QSFP28

Sometimes I use these terms loosely. Here an article that explain the 3 fiber optic transceivers QSFP, QSFP+ and QSFP28

Taken from the article “Difference between QSFP, QSFP+, QSFP28

Here are some main points

  1. The QSFP specification supports Ethernet, Fibre Channel, InfiniBand and SONET/SDH standards with different data rate options.
  2. QSFP transceivers support the network link over singlemode or multimode fiber patch cable.
  3. Common ones are 4x10G QSFP+, 4x28G QSFP28
  4. QSFP+ are designed to support 40G Ethernet, Serial Attached SCSI, QDR (40G) and FDR (56G) Infiniband, and other communication standards
  5. QSFP+ modules integrate 4 transmit and 4 receive channels plus sideband signals. Then QSFP+ modules can break out into 4x10G lanes. 
  6. QSFP28 is a hot-pluggable transceiver module designed for 100G data rate.
  7. QSFP28 integrates 4 transmit and 4 receiver channels. “28” means each lane carries up to 28G data rate.
  8. QSFP28 can do 4x25G breakout connection, 2x50G breakout, or 1x100G depending on the transceiver used.
  9. Usually QSFP28 modules can’t break out into 10G links. But it’s another case to insert a QSFP28 module into a QSFP+ port if switches support.
  10. QSFP+ and QSFP28 modules can support both short and long-haul transmission.

Using firewall-cmd to configure gateways and isolated client network on CentOS-7 and Rocky Linux 8

Objectives:

Compute Nodes in an HPC environment are usually physically isolated from the public network and has to route through the gateway which are often found in Head Node or any delegated Node in small or small-medium size cluster to access the internet or to access company LAN to access LDAP, you can use the firewall-cmd to route the traffic through the interconnect facing the internet.

Scenario:

Traffic will be routed through the Head Node’s eno1 (internet facing) from the Head Node’s eno2 (private network). The interconnect eno1 is attached to a switch where the compute nodes are similarly attached. Some

  1. 192.168.1.0/24 is the private network subnet.
  2. 192.168.1.1 is the IP Address of the Head Node
  3. 155.1.1.2 is the IP Address of the external-facing ethernet ie eno1

Check the zones.

# firewall-cmd --list-all-zones

Check the Active Zones

# firewall-cmd --get-active-zones
external
  interfaces: eno2
internal
  interfaces: eno1

Enable masquerade at the Head Node’s External Zone

IP masquerading is a process where one computer acts as an IP gateway for a network. For masquerading, the gateway dynamically looks up the IP of the outgoing interface all the time and replaces the source address in the packets with this address.

You use masquerading if the IP of the outgoing interface can change. A typical use case for masquerading is if a router replaces the private IP addresses, which are not routed on the internet, with the public dynamic IP address of the outgoing interface on the router.

For more information. Do take a look at 5.10. Configuring IP Address Masquerading

# firewall-cmd --zone=external --query-masquerade 
no
# firewall-cmd --zone=external --add-masquerade --permanent
# firewall-cmd --reload

Compute Nodes at the Private Network 

(Assuming that eno1 is connected to the private switch). It is very important that you input the gateway at the compute node’s /etc/sysconfig/network-scripts/ifcfg-eno1)

.....
.....
DEVICE=enp47s0f1
ONBOOT=yes
IPADDR=192.168.1.2 #Internal IP Address of the Compute Node
NETMASK=255.255.255.0
GATEWAY=192.168.1.1 #Internal IP Address of the Head Node

Next, you have to put the Network Interface of the Client in the Internal Zone of the firewall-cmd. Assuming that eno1 is also used by the Client Network

# firewall-cmd --zone=internal --change-interface=eno1 --permanent

You may want to set the selinux to disabled

# setenforce 0

Configure the Head Node’s External Zone.

For Zoning, do take a look at 5.7.8. Using Zone Targets to Set Default Behavior for Incoming Traffic

For this setting, we have chosen target “default”

# firewall-cmd --zone=external --set-target=default

You can configure other settings. For the External Zone. For example, add SSH Service, mDNS

# firewall-cmd --permanent --zone=external --add-service=ssh
# firewall-cmd --permanent --zone=external --add-service=mdns
# firewall-cmd --runtime-to-permanent
# firewall-cmd --reload

Make sure the right Ethernet is placed in the right Zone. For External-Facing Ethernet Card, (eno2), you may want to place it

# firewall-cmd --zone=external --change-interface=eno2 --permanent

For Internal Facing Ethernet Card, (eno1), you want want to place it

# firewall-cmd --zone=internal --change-interface=eno1 --permanent

Configure the firewall-Source of Internal Network (eno1)

# firewall-cmd --zone=internal --add-source=192.168.1.0/24

Checking the Settings in the “firewall-cmd –get-active-zones”

# firewall-cmd --get-active-zones
internal (active)
  target: default
  icmp-block-inversion: no
  interfaces: eno1
  sources: 192.168.1.0/32
  services: dhcpv6-client mdns ssh
  ports:
  protocols:
  forward: no
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eno2
  sources:
  services: dhcpv6-client ssh
  ports: 
  protocols:
  forward: no
  masquerade: yes
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

Check the Firewall Status

systemctl status firewalld.service

Encountering shm_open permission denied issues with hpcx

If you are using Nvidia hpc-x and encountering issues like the one below during your MPI Run

shm_open(file_name=/ucx_shm_posix_77de2cf3 flags=0xc2) failed: Permission denied

The error message indicates that the shared memory has no permission to be used,  The permission of /dev/shm is found to be 755, not 777, causing the error. The issue can be resolved after the permission is changed to 777. To change and verify the changes:

% chmod 777 /dev/shm 
% ls -ld /dev/shm
drwxrwxrwx 2 root root 40 Jul  6 15:18 /dev/sh

EOL notice for Mellanox ConnectX-5 VPI host channel adapters and Switch-IB 2 based EDR InfiniBand Switches

Nvidia Corporation has announced the EOL Notice #LCR-000906 – MELLANOX

PCN INFORMATION:
PCN Number: LCR-000906 – MELLANOX
PCN Description: EOL notice for Mellanox ConnectX-5 VPI host channel adapters and Switch-IB 2 based EDR InfiniBand Switches
Publish Date: Sun May 08 00:00:00 GMT 2022
Type: FYI

A relook at InfiniBand and Ethernet Trends on Top500

I have put up a article from Nvidia Perspective on the Top 500 Interconnect Trends. There is another article put up by the NextPlatform that took a closer look at the Infiniband and Ethernet Trends

Taken from The Next Platform “The Eternal Battle Between Infiniband and Ethernet”

The penetration of Ethernet rises as the list fans out, as you might expect, with many academic and industry HPC systems not being able to afford InfiniBand or not willing to switch away from Ethernet. And as those service providers, cloud builders, and hyperscalers run Linpack on small portions of their clusters for whatever political or business reasons they have. Relatively slow Ethernet is popular in the lower half of the Top500 list, and while InfiniBand gets down there, its penetration drops from 70 percent in the Top10 to 34 percent in the complete Top500.

Nvidia’s InfiniBand has 34 percent share of Top500 interconnects, with 170 systems, but what has not been obvious is the rise of Mellanox Spectrum and Spectrum-2 Ethernet switches on the Top500, which accounted for 148 additional systems. That gives Nvidia a 63.6 percent share of all interconnects on the Top500 rankings. That is the kind of market share that Cisco Systems used to enjoy for two decades in the enterprise datacenter, and that is quite an accomplishment.

Taken from The Next Platform “The Eternal Battle Between Infiniband and Ethernet”

References:

The Eternal Battle Between Infiniband and Ethernet

UDP Tuning to maximise performance

There is a interesting article how your UDP traffic can maximise performance with a few tweak. The article is taken from UDP Tuning

The most important factors as mentioned in the article is

  • Use jumbo frames: performance will be 4-5 times better using 9K MTUs
  • packet size: best performance is MTU size minus packet header size. For example for a 9000Byte MTU, use 8972 for IPV4, and 8952 for IPV6.
  • socket buffer size: For UDP, buffer size is not related to RTT the way TCP is, but the defaults are still not large enough. Setting the socket buffer to 4M seems to help a lot in most cases
  • core selection: UDP at 10G is typically CPU limited, so its important to pick the right core. This is particularly true on Sandy/Ivy Bridge motherboards.

Do take a look at the article UDP Tuning

Performance Required for Deep Learning

There is this question that I wanted to find out about deep learning. What are essential System, Network, Protocol that will speed up the Training and/or Inferencing. There may not be necessary to employ the same level of requirements from Training to Inferencing and Vice Versa. I have received this information during a Nvidia Presentation

Training:

  1. Scalability requires ultra-fast networking
  2. Same hardware needs as HPC
  3. Extreme network bandwidth
  4. RDMA
  5. SHARP (Mellanox Scalable Hierarchical Aggregation and Reduction Protocol)
  6. GPUDirect (https://developer.nvidia.com/gpudirect)
  7. Fast Access Storage

Influencing

  1. Highly Transactional
  2. Ultra-low Latency
  3. Instant Network Response
  4. RDMA
  5. PeerDirect, GPUDirect