IPoIB working modes

The IPoIB driver supports two modes of operation: Unreliable Datagram (UD) and Connected Mode.

In Unreliable datagram mode, the IB UD (Unreliable Datagram) transport is used and so the interface MTU has is equal to the IB L2 MTU minus the IPoIB encapsulation header (4 bytes).  In QDR, the default NTU value is 2044. In FDR onwards, the default MTU value for Unreliable Datagram is 4096.

In Connected Mode, the IB RC (Reliable Connected) transport is used.Connected mode takes advantage of the connected nature of the IB transport and allows an MTU up to the maximal IP packet size of 64K, which reduces the number of IP packets needed for handling large UDP datagrams, TCP segments, etc and increases the performance for large messages. Default MTU will be 65000. Performance will be better

To verify what modes you are working on, just do a

# cat /sys/class/net/ib0/mode
Datagram

 

Using ibdiagnet to generate topology of the network.

You can use the ibdiagnet to generate the topology of the IB Network simply by using the “-w” switch

# ibdiagnet -w /var/tmp/ibdiagnet2/topology.top
.....
.....
-I- ibdiagnet database file   : /var/tmp/ibdiagnet2/ibdiagnet2.db_csv
-I- LST file                  : /var/tmp/ibdiagnet2/ibdiagnet2.lst
-I- Topology file             : /var/tmp/ibdiagnet2/topology.top
-I- Subnet Manager file       : /var/tmp/ibdiagnet2/ibdiagnet2.sm
-I- Ports Counters file       : /var/tmp/ibdiagnet2/ibdiagnet2.pm
-I- Nodes Information file    : /var/tmp/ibdiagnet2/ibdiagnet2.nodes_info
-I- Partition keys file       : /var/tmp/ibdiagnet2/ibdiagnet2.pkey
-I- Alias guids file          : /var/tmp/ibdiagnet2/ibdiagnet2.aguid
# vim /var/tmp/ibdiagnet2/topology.top
# This topology file was automatically generated by IBDM

SX6036G Left-Leaf-SW03
U1/P1 -4x-14G-> HCA_1 mtlacad05 U1/P1
U1/P17 -4x-14G-> SX6012 Right-Spine-SW02 U1/P2
U1/P18 -4x-14G-> SX6012 Left-Spine-SW01 U1/P2
U1/P2 -4x-14G-> HCA_1 mtlacad07 U1/P1
U1/P3 -4x-14G-> HCA_1 mtlacad03 U1/P1
U1/P4 -4x-14G-> HCA_1 mtlacad04 U1/P1
U1/P6 -4x-14G-> HCA_1 mtlacad06 U1/P1
.....
.....

Querying the logical and physical port states of an IB Port

ibportstate

  • Enables the querying of the logical link and physical por tstates of an IB Port.
  • Displays information such as LinkSpeed, LinkWidth and extended link speed
  • Allows adjusting of link speed that is enabled on any IB Port
# ibportstate LID PortNumber
# Port info: Lid 15 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................15
SMLid:...........................1
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
LinkSpeedExtSupported:...........14.0625 Gbps
LinkSpeedExtEnabled:.............14.0625 Gbps
LinkSpeedExtActive:..............14.0625 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
# MLNX ext Port info: Lid 15 port 1
StateChangeEnable:...............0x00
LinkSpeedSupported:..............0x01
LinkSpeedEnabled:................0x01
LinkSpeedActive:.................0x00

Tools for Performance Test for IB

If you are using Mellanox IB Switches, you can use the following to do conduct performance tests, these are:

Latency Server Side:

  1. ib_write_lat
  2. ib_read_lat
  3. ib_send_lat

Latency Client Side:

  1. ib_write_lat IP_Addresses
  2. ib_read_lat IP_Addresses
  3. ib_send_lat IP_Addresses

For examples:

1a. Latency Server Side

# ib_read_lat

1b. Client Side

# ib_read_lat IP_Address_of_Server -F -a
#bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
 2       1000          1.66           12.98        1.70
 4       1000          1.64           13.40        1.67
 8       1000          1.64           20.25        1.67
 16      1000          1.64           19.61        1.68
.....
.....
 4096    1000          2.94           18.45        2.99

2a. Bandwidth Server Side

# ib_read_bw

2b. Bandwidth Client Side

# ib_read_bw -F -a
#bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          1000             6.97               6.47               3.394435
....
....
8192       1000             5983.30            5982.07            0.765704
....
....
65536      1000             6075.37            6042.28            0.096676

 

 

Understanding ibtracert command

Suppose you have source and destination server with a IB Switch in-between

Server 1 is as below. I have changed the GUID and mask for confidentiality sake

[root@mtlacad03 ~]# ibstat
CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.33.5100
Hardware version: 0
Node GUID: 00000000000000
System image GUID: 00000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 15
LMC: 0
SM lid: 5
Capability mask: 0000000000000
Port GUID: 00000000000
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 111111111111
Port GUID: 1111111111111
Link layer: Ethernet

Server 2 is

[root@mtlacad07 ~]# ibstat
CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.33.5100
Hardware version: 0
Node GUID: 000000000000
System image GUID: 0000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 13
LMC: 0
SM lid: 5
Capability mask: 00000000000
Port GUID: 000000000000000
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 1111111111111
Port GUID: 1111111111111
Link layer: Ethernet

So if you do a ibtracert

[root@mtlacad07 ~]# ibtracert 13 15
From ca {0000000000000} portnum 1 lid 13-13 "mtlacad07 HCA-1"
[1] -> switch port {0000000000000}[2] lid 20-20 "MF0;Left-Leaf-SW03:SX6036G/U1"
[3] -> ca port {00000000000}[1] lid 15-15 "mtlacad03 HCA-1"
To ca {0000000000000} portnum 1 lid 15-15 "mtlacad03 HCA-1"

Basically, what it mean is that the

  • Traffic is leaving mtlacad07 HCA-1 Port [1]
  • Traffic is entering Switch Port 2 at the Left-Leaf Switch
  • Traffic is leaving Switch Port 3 at the Left-Leaf Switch
  • Traffic is entering mtlacad03 HCA-1 Port [1-

Able to ping IPoIB for selected existing nodes when adding new nodes

When I add in new nodes, install the MLNX_OFED drivers from Mellanox. Strangely I was only able to randomly ping to selected existing or new nodes on the Cluster. This was quite a curious problem.

When I do a ibstat, but when you do a ibping test (Installing Voltaire QDR Infiniband Drivers for CentOS 5.4), the test will failed for selected nodes in the cluster, but others will be able to ping back.

Yes, both openibd and opensmd services are started for all nodes on the cluster,

After some troubleshooting, the only way was to stop all the opensmd service for all the nodes (existing and new) and restart it again

# service openibd restart
# service opensmd restart

40Gb Ethernet – A Competitive Alternative to InfiniBand (White paper)

Chelsio

This paper supports this conclusion with three real application benchmarks running on IBM’s Rackswitch G8316, a 40Gb Ethernet aggregation switch, in conjunction with Chelsio Communications’ 40Gb Ethernet Unified Wire network adapter. This paper shows how iWARP offers comparable application level performance at 40Gbps with the latest InfiniBand FDR speeds.

40Gb Ethernet: A Competitive Alternative to InfiniBand – LAMMPS, WRF and Quantum ESRESSO Modeling with 40Gb iWARP Technology (pdf)

Enabling SRIOV on Intel Ethernet Server Adapter

First thing first

Step 1: Check that the Intel Ethernet Server Adapter. For more information, do take a look at Using SR-IOV with Intel® Ethernet Server Adapters

In a nutshell, You blacklist the vf driver in the host, and enable the VFs as part of the kvm guests.

Step 1: Add a line /etc/modprobe.conf

options ixgbe max_vfs=8

The above configuration will create 8 Virtual Nics per Port. The Intel Card supports up to 64 FVs.

Step 2: Blacklist the ixgbevf driver by creating a file called /etc/modprobe.d/blacklist-ixgbevf.conf

blacklist ixgbevf

Step 3: Reboot the machine