Fabric Debug Initiation using ibdiagnet (Part 1)

Learn some of these steps from Mellanox Academy Online Training

Step 1: Clear all counters and begin the test execution

ibdiagnet -pc

Wait for a while. Usually, it may take 30 to 60 mins

Check for errors that exceed the allowed threshold

ibdiagnet -ls 25 -lw 4x -P all=1 --pm_pause_time 30
  • Specify the link speed
    -ls <2.5|5|10|14|25|50> 
  • Specify the Link width
    -lw <1x|4x|8x|12x>
  • Check Information provide from all counters and display each one of them crossing threshold of 1
    -P all=1
  • The time between the two samples is set by the –pm_pause_time option

How to find which switch the OpenSM is running on

1. On any clients do a ibstat

CA type: MT26428
        Number of ports: 2
        Firmware version: 2.9.1000
        Hardware version: b0
        Node GUID: 0x000xxxxxxxxxx
        System image GUID: 0xxxxxxxxxxxxxxxx
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 184
                LMC: 0
                SM lid: 1
                Capability mask: 0x02510868
                Port GUID: 0x0002c903005abfd7
                Link layer: InfiniBand

2. To check for the SM Manager

# smpquery ND -L 1
Node Description:.......Voltaire 4036 # 4036-0D9E

where 1 is the SM Lid:1

3. If you query InfiniBand SMInfo attribute

# sminfo
sminfo: sm lid 1 sm guid 0x8f10500200d9e, activity count 42554 priority 4 state 3 SMINFO_MASTER

 

IPoIB working modes

The IPoIB driver supports two modes of operation: Unreliable Datagram (UD) and Connected Mode.

In Unreliable datagram mode, the IB UD (Unreliable Datagram) transport is used and so the interface MTU has is equal to the IB L2 MTU minus the IPoIB encapsulation header (4 bytes).  In QDR, the default NTU value is 2044. In FDR onwards, the default MTU value for Unreliable Datagram is 4096.

In Connected Mode, the IB RC (Reliable Connected) transport is used.Connected mode takes advantage of the connected nature of the IB transport and allows an MTU up to the maximal IP packet size of 64K, which reduces the number of IP packets needed for handling large UDP datagrams, TCP segments, etc and increases the performance for large messages. Default MTU will be 65000. Performance will be better

To verify what modes you are working on, just do a

# cat /sys/class/net/ib0/mode
Datagram

 

Using ibdiagnet to generate topology of the network.

You can use the ibdiagnet to generate the topology of the IB Network simply by using the “-w” switch

# ibdiagnet -w /var/tmp/ibdiagnet2/topology.top
.....
.....
-I- ibdiagnet database file   : /var/tmp/ibdiagnet2/ibdiagnet2.db_csv
-I- LST file                  : /var/tmp/ibdiagnet2/ibdiagnet2.lst
-I- Topology file             : /var/tmp/ibdiagnet2/topology.top
-I- Subnet Manager file       : /var/tmp/ibdiagnet2/ibdiagnet2.sm
-I- Ports Counters file       : /var/tmp/ibdiagnet2/ibdiagnet2.pm
-I- Nodes Information file    : /var/tmp/ibdiagnet2/ibdiagnet2.nodes_info
-I- Partition keys file       : /var/tmp/ibdiagnet2/ibdiagnet2.pkey
-I- Alias guids file          : /var/tmp/ibdiagnet2/ibdiagnet2.aguid
# vim /var/tmp/ibdiagnet2/topology.top
# This topology file was automatically generated by IBDM

SX6036G Left-Leaf-SW03
U1/P1 -4x-14G-> HCA_1 mtlacad05 U1/P1
U1/P17 -4x-14G-> SX6012 Right-Spine-SW02 U1/P2
U1/P18 -4x-14G-> SX6012 Left-Spine-SW01 U1/P2
U1/P2 -4x-14G-> HCA_1 mtlacad07 U1/P1
U1/P3 -4x-14G-> HCA_1 mtlacad03 U1/P1
U1/P4 -4x-14G-> HCA_1 mtlacad04 U1/P1
U1/P6 -4x-14G-> HCA_1 mtlacad06 U1/P1
.....
.....

Querying the logical and physical port states of an IB Port

ibportstate

  • Enables the querying of the logical link and physical por tstates of an IB Port.
  • Displays information such as LinkSpeed, LinkWidth and extended link speed
  • Allows adjusting of link speed that is enabled on any IB Port
# ibportstate LID PortNumber
# Port info: Lid 15 port 1
LinkState:.......................Active
PhysLinkState:...................LinkUp
Lid:.............................15
SMLid:...........................1
LMC:.............................0
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
LinkSpeedExtSupported:...........14.0625 Gbps
LinkSpeedExtEnabled:.............14.0625 Gbps
LinkSpeedExtActive:..............14.0625 Gbps
Mkey:............................<not displayed>
MkeyLeasePeriod:.................0
ProtectBits:.....................0
# MLNX ext Port info: Lid 15 port 1
StateChangeEnable:...............0x00
LinkSpeedSupported:..............0x01
LinkSpeedEnabled:................0x01
LinkSpeedActive:.................0x00

Tools for Performance Test for IB

If you are using Mellanox IB Switches, you can use the following to do conduct performance tests, these are:

Latency Server Side:

  1. ib_write_lat
  2. ib_read_lat
  3. ib_send_lat

Latency Client Side:

  1. ib_write_lat IP_Addresses
  2. ib_read_lat IP_Addresses
  3. ib_send_lat IP_Addresses

For examples:

1a. Latency Server Side

# ib_read_lat

1b. Client Side

# ib_read_lat IP_Address_of_Server -F -a
#bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
 2       1000          1.66           12.98        1.70
 4       1000          1.64           13.40        1.67
 8       1000          1.64           20.25        1.67
 16      1000          1.64           19.61        1.68
.....
.....
 4096    1000          2.94           18.45        2.99

2a. Bandwidth Server Side

# ib_read_bw

2b. Bandwidth Client Side

# ib_read_bw -F -a
#bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          1000             6.97               6.47               3.394435
....
....
8192       1000             5983.30            5982.07            0.765704
....
....
65536      1000             6075.37            6042.28            0.096676

 

 

Understanding ibtracert command

Suppose you have source and destination server with a IB Switch in-between

Server 1 is as below. I have changed the GUID and mask for confidentiality sake

[root@mtlacad03 ~]# ibstat
CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.33.5100
Hardware version: 0
Node GUID: 00000000000000
System image GUID: 00000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 15
LMC: 0
SM lid: 5
Capability mask: 0000000000000
Port GUID: 00000000000
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 111111111111
Port GUID: 1111111111111
Link layer: Ethernet

Server 2 is

[root@mtlacad07 ~]# ibstat
CA 'mlx4_0'
CA type: MT4103
Number of ports: 2
Firmware version: 2.33.5100
Hardware version: 0
Node GUID: 000000000000
System image GUID: 0000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 13
LMC: 0
SM lid: 5
Capability mask: 00000000000
Port GUID: 000000000000000
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 1111111111111
Port GUID: 1111111111111
Link layer: Ethernet

So if you do a ibtracert

[root@mtlacad07 ~]# ibtracert 13 15
From ca {0000000000000} portnum 1 lid 13-13 "mtlacad07 HCA-1"
[1] -> switch port {0000000000000}[2] lid 20-20 "MF0;Left-Leaf-SW03:SX6036G/U1"
[3] -> ca port {00000000000}[1] lid 15-15 "mtlacad03 HCA-1"
To ca {0000000000000} portnum 1 lid 15-15 "mtlacad03 HCA-1"

Basically, what it mean is that the

  • Traffic is leaving mtlacad07 HCA-1 Port [1]
  • Traffic is entering Switch Port 2 at the Left-Leaf Switch
  • Traffic is leaving Switch Port 3 at the Left-Leaf Switch
  • Traffic is entering mtlacad03 HCA-1 Port [1-

Able to ping IPoIB for selected existing nodes when adding new nodes

When I add in new nodes, install the MLNX_OFED drivers from Mellanox. Strangely I was only able to randomly ping to selected existing or new nodes on the Cluster. This was quite a curious problem.

When I do a ibstat, but when you do a ibping test (Installing Voltaire QDR Infiniband Drivers for CentOS 5.4), the test will failed for selected nodes in the cluster, but others will be able to ping back.

Yes, both openibd and opensmd services are started for all nodes on the cluster,

After some troubleshooting, the only way was to stop all the opensmd service for all the nodes (existing and new) and restart it again

# service openibd restart
# service opensmd restart

40Gb Ethernet – A Competitive Alternative to InfiniBand (White paper)

Chelsio

This paper supports this conclusion with three real application benchmarks running on IBM’s Rackswitch G8316, a 40Gb Ethernet aggregation switch, in conjunction with Chelsio Communications’ 40Gb Ethernet Unified Wire network adapter. This paper shows how iWARP offers comparable application level performance at 40Gbps with the latest InfiniBand FDR speeds.

40Gb Ethernet: A Competitive Alternative to InfiniBand – LAMMPS, WRF and Quantum ESRESSO Modeling with 40Gb iWARP Technology (pdf)

PBS scripts for mpirun parameters for Chelsio / Infiniband Cards

If you are running Chelsio Cards, you  may want to specify the mpirun parameters to ensure the

/usr/mpi/intel/openmpi-1.4.3/bin/mpirun 
-mca btl openib,sm,self --bind-to-core 
--report-bindings -np $NCPUS -machinefile $PBS_NODEFILE $PBS_O_WORKDIR/$file

–bind-to-core: Bind each MPI process to a core
–mca btl openib,sm,self: (Infiniband, shared memory, the loopback)

For information on Interprocess communication with shared memory,

  1. see Speaking UNIX: Interprocess communication with shared memory