Diagnostic Tools to diagnose Infiniband Fabric Information

There are a few diagnostic tools to diagnose Infiniband Fabric Information. Use man for the parameters for the

  1. ibnodes – (Show Infiniband nodes in topology)
  2. ibhosts – (Show InfiniBand host nodes in topology)
  3. ibswitches- (Show InfiniBand switch nodes in topology)
  4. ibnetdiscover – (Discover InfiniBand topology)
  5. ibchecknet – (Validate IB subnet and report errors)
  6. ibdiag (Scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices)
  7. perfquery (find errors on a particular or number of HCA’s and switch ports)

ibnodes (Show Infiniband nodes in topology)

ibnodes is a script which either walks the IB subnet topology  or  uses an  already  saved  topology  file  and  extracts the IB nodes (CAs and switches)

# ibnodes
.....
Ca      : 0x0000000000009b02 ports 2 "c00 HCA-1"
Ca      : 0x0000000000005af0 ports 1 "h00 HCA-1"
Switch  : 0x00000000000000fa ports 36 "IBM HSSM" enhanced port 0 lid 19 lmc 0
.....

ibhosts  (Show InfiniBand host nodes in topology)

ibhosts is a script which either walks the IB subnet topology  or  uses an already saved topology file and extracts the CA nodes.

# ibhosts
Ca      : 0x0000000000009b02 ports 2 "c00 HCA-1"
Ca      : 0x0000000000005af0 ports 1 "h00 HCA-1"

ibswitches (Show InfiniBand switch nodes in topology)

ibswitches is a script which either walks the  IB  subnet  topology  or uses an already saved topology file and extracts the switch nodes.

# ibswitches
Switch  : 0x00000000000003fa ports 36 "IBM HSSM" enhanced port 0 lid 19 lmc 0
Switch  : 0x00000000000003cc ports 36 "IBM HSSM" enhanced port 0 lid 16 lmc 0

ibnetdiscover (Discover InfiniBand topology)

ibnetdiscover performs IB subnet discovery and outputs a human readable topology file. GUIDs, node types, and port numbers are displayed  as  well as port LIDs and NodeDescriptions.  All nodes (and links) are displayed (full topology).  Optionally, this utility can be used to list the current connected nodes by nodetype.  The output is printed to standard output unless a topology file is specified.

# ibnetdiscover
#
# Topology file: generated on Mon Jan 28 14:19:57 2013
#
# Initiated from node 0000000000000080 port 0000090300451281

vendid=0x2c9
devid=0xc738
sysimgguid=0x2c90000000000
switchguid=0x2c90000000080(0000000000080)
Switch  36 "S-0002c9030071ba80"         # "MF0;switch-6260a0:SX90Y3245/U1" enhanced port 0 lid 2 lmc 0
[2]     "H-00000000000011e0"[1](00000000000e1)          # "node-c01 HCA-1" lid 3 4xQDR
[3]     "H-00000000000012d0"[1](00000000000d1)          # "node-c02 HCA-1" lid 4 4xQDR
....
....

ibchecknet (Validate IB subnet and report errors)

# ibchecknet
......
......
## Summary: 31 nodes checked, 0 bad nodes found
##          88 ports checked, 59 bad ports found
##          12 ports have errors beyond threshold

perfquery command

The perfquery command is useful for find errors on a particular or number of HCA’s and switch ports. You can also use perfquery to reset HCA and switch port counters.

# Port counters: Lid 1 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............0
LinkErrorRecoveryCounter:........0
LinkDownedCounter:...............0
PortRcvErrors:...................13
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................199578830
PortRcvData:.....................504398997
PortXmitPkts:....................15649860
PortRcvPkts:.....................15645526
PortXmitWait:....................0

References:

  1. Appendix B. InfiniBand Fabric Troubleshooting

Diagnostic Tools to diagnose Infiniband Device

There are a few Diagnostic Tools to diagnose Infiniband Devices.

  1. ibv_devinfo (Query RDMA devices)
  2. ibstat (Query basic status of InfiniBand device(s))
  3. ibstatus (Query basic status of InfiniBand device(s))

ibv_devinfo (Query RDMA devices)

Print  information about RDMA devices available for use from userspace.

# ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.10.2322
        node_guid:                      0002:c903:0045:1280
        sys_image_guid:                 0002:c903:0045:1283
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        board_id:                       IBM0FD0140019
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             IB

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             IB

ibstat (Query basic status of InfiniBand device(s))

ibstat is a binary which displays basic information obtained  from  the local  IB  driver.  Output  includes LID, SMLID, port state, link width active, and port physical state.

It is similar to the ibstatus  utility  but  implemented  as  a  binary rather  than a script. It has options to list CAs and/or ports and displays more information than ibstatus.

# ibstat
CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.10.2322
        Hardware version: 0
        Node GUID: 0x0002c90300451280
        System image GUID: 0x0002c90300451283
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0x0002c90300451281
                Link layer: InfiniBand
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02514868
                Port GUID: 0x0002c90300451282
                Link layer: InfiniBand

ibstatus – (Query basic status of InfiniBand device(s))

ibstatus is a script which displays basic information obtained from the local IB driver. Output includes LID, SMLID,  port  state,  link  width active, and port physical state.

# ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c903:0045:1281
        base lid:        0x1
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      InfiniBand

Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0002:c903:0045:1282
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      2: Polling
        rate:            40 Gb/sec (4X QDR)
        link_layer:      InfiniBand

Sample PBS Scripts for R

Here is a sample of PBS Scripts that can be used for R. This is just a suggested PBS script. Modify and comment at will. The script below is named as R.sh

#!/bin/bash
#PBS -N R-job
#PBS -j oe
#PBS -V
#PBS -m bea
#PBS -M myemail@hotmail.com
#PBS -l nodes=1:ppn=8

# comment these out if you wish
echo "qsub host = " $PBS_O_HOST
echo "original queue = " $PBS_O_QUEUE
echo "qsub working directory absolute = " $PBS_O_WORKDIR
echo "pbs environment = " $PBS_ENVIRONMENT
echo "pbs batch = " $PBS_JOBID
echo "pbs job name from me = " $PBS_JOBNAME
echo "Name of file containing nodes = " $PBS_NODEFILE
echo "contents of nodefile = " $PBS_NODEFILE
echo "Name of queue to which job went = " $PBS_QUEUE

# Pre-processing script
cd $PBS_O_WORKDIR
NCPUS=`cat $PBS_NODEFILE | wc -l`
echo "Number of requested processors = " $NCPUS

# Load R Module
module load mpi/intel_1.4.3
module load intel/12.0.2
module load R/R-2.15.1

# ###############
# Execute Program
# ################
/usr/local/R-2.15.1/bin/R CMD BATCH $file

The corresponding qsub command and its parameter should be something like

$ qsub -q dqueue -l nodes=1:ppn=8 R.sh -v file=Rjob.r

Sample PBS Scripts for MATLAB

Here is a sample of PBS Scripts that can be used for MATLAB. This is just a suggested PBS script. Modify and comment at will. The script below is named as matlab_serial.sh

#!/bin/bash
#PBS -N MATLAB_Serial
#PBS -j oe
#PBS -V
#PBS -m bea
#PBS -M myemail@hotmail.com
#PBS -l nodes=1:ppn=1

# comment these out if you wish
echo "qsub host = " $PBS_O_HOST
echo "original queue = " $PBS_O_QUEUE
echo "qsub working directory absolute = " $PBS_O_WORKDIR
echo "pbs environment = " $PBS_ENVIRONMENT
echo "pbs batch = " $PBS_JOBID
echo "pbs job name from me = " $PBS_JOBNAME
echo "Name of file containing nodes = " $PBS_NODEFILE
echo "contents of nodefile = " $PBS_NODEFILE
echo "Name of queue to which job went = " $PBS_QUEUE

## pre-processing script
cd $PBS_O_WORKDIR
NCPUS=`cat $PBS_NODEFILE | wc -l`
echo "Number of requested processors = " $NCPUS

# Load MATLAB Module
module load intel/12.0.2
module load matlab/R2011b

cd $PBS_O_WORKDIR
/usr/local/MATLAB/R2011b/bin/matlab -nodisplay -r $file

The corresponding qsub command and its parameter should be something like

$ qsub -q dqueue -l nodes=1:ppn=8 matlab_serial.sh -v file=yourmatlabfile.m

Configuring the Torque Default Queue

Here are the sample Torque Queue configuration

qmgr -c "create queue dqueue"
qmgr -c "set queue dqueue queue_type = Execution"
qmgr -c "set queue dqueue resources_default.neednodes = dqueue"
qmgr -c "set queue dqueue enabled = True"
qmgr -c "set queue dqueue started = True"

qmgr -c "set server scheduling = True"
qmgr -c "set server acl_hosts = headnode.com"
qmgr -c "set server default_queue = dqueue"
qmgr -c "set server log_events = 127"
qmgr -c "set server mail_from = Cluster_Admin"
qmgr -c "set server query_other_jobs = True"
qmgr -c "set server resources_default.walltime = 240:00:00"
qmgr -c "set server resources_max.walltime = 720:00:00"
qmgr -c "set server scheduler_iteration = 60"
qmgr -c "set server node_check_rate = 150"
qmgr -c "set server tcp_timeout = 6"
qmgr -c "set server node_pack = False"
qmgr -c "set server mom_job_sync = True"
qmgr -c "set server keep_completed = 300"
qmgr -c "set server submit_hosts = headnode1.com"
qmgr -c "set server submit_hosts += headnode2.com"
qmgr -c "set server allow_node_submit = True"
qmgr -c "set server auto_node_np = True"
qmgr -c "set server next_job_number = 21293"

Quick method for estimating walltime for Torque Resource Manager

For Torque / OpenPBS or any other scheduler, walltime is a important parameter to allow the scheduler to determine how long the jobs will take. You can do a quick rough estimate by using the command time

# time -p mpirun -np 16 --host node1,node2 hello_world_mpi
real 4.31
user 0.04
sys 0.01

Use the value of 4:31 as the estimate walltime. Since this is an estimate, you may want to place a higher value in the walltime

$ qsub -l walltime=5:00 -l nodes=1:ppn=8 openmpi.sh -v file=hello_world

OFED Performance Micro-Benchmark Latency Test

Open Fabrics Enterprise Distribution (OFED) has provided simple performance micro-benchmark has provided a collection of tests written over uverbs. Some notes taken from OFED Performance Tests README

  1. The benchmark uses the CPU cycle counter to get time stamps without a context switch.
  2. The benchmark measures round-trip time but reports half of that as one-way latency. This means that it may not be sufficiently accurate for asymmetrical configurations.
  3. Min/Median/Max results are reported.
    The Median (vs average) is less sensitive to extreme scores.
    Typically, the Max value is the first value measured Some CPU architectures
  4. Larger samples only help marginally. The default (1000) is very satisfactory.   Note that an array of cycles_t (typically an unsigned long) is allocated once to collect samples and again to store the difference between them.   Really big sample sizes (e.g., 1 million) might expose other problems with the program.

On the Server Side

# ib_write_lat -a

On the Client Side

# ib_write_lat -a Server_IP_address
------------------------------------------------------------------
                    RDMA_Write Latency Test
 Number of qps   : 1
 Connection type : RC
 Mtu             : 2048B
 Link type       : IB
 Max inline data : 400B
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
------------------------------------------------------------------
 local address: LID 0x01 QPN 0x02ce PSN 0x1bd93e RKey 0x014a00 VAddr 0x002b7004651000
 remote address: LID 0x03 QPN 0x00f2 PSN 0x20aec7 RKey 0x010100 VAddr 0x002aeedfbde000
------------------------------------------------------------------

#bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]
2       1000          0.92           5.19         1.24
4       1000          0.92           65.20        1.24
8       1000          0.90           72.28        1.23
16      1000          0.92           19.56        1.25
32      1000          0.94           17.74        1.26
64      1000          0.94           26.40        1.20
128     1000          1.05           53.24        1.36
256     1000          1.70           21.07        1.83
512     1000          2.13           11.61        2.22
1024    1000          2.44           8.72         2.52
2048    1000          2.79           48.23        3.09
4096    1000          3.49           52.59        3.63
8192    1000          4.58           64.90        4.69
16384   1000          6.63           42.26        6.76
32768   1000          10.80          31.11        10.91
65536   1000          19.14          35.82        19.23
131072  1000          35.56          62.17        35.84
262144  1000          68.95          80.15        69.10
524288  1000          135.34         195.46       135.62
1048576 1000          268.37         354.36       268.64
2097152 1000          534.34         632.83       534.67
4194304 1000          1066.41        1150.52      1066.71
8388608 1000          2130.80        2504.32      2131.39

Common Options you can use.

Common Options to all tests:
-p, --port=<port>            listen on/connect to port <port> (default: 18515)
-m, --mtu=<mtu>              mtu size (default: 1024)
-d, --ib-dev=<dev>           use IB device <dev> (default: first device found)
-i, --ib-port=<port>         use port <port> of IB device (default: 1)
-s, --size=<size>            size of message to exchange (default: 1)
-a, --all                    run sizes from 2 till 2^23
-t, --tx-depth=<dep>         size of tx queue (default: 50)
-n, --iters=<iters>          number of exchanges (at least 100, default: 1000)
-C, --report-cycles          report times in cpu cycle units (default: microseconds)
-H, --report-histogram       print out all results (default: print summary only)
-U, --report-unsorted        (implies -H) print out unsorted results (default: sorted)
-V, --version                display version number

Multiprotocol Performance Test of VMware EX 3.5 on NetApp Storage Systems

NetApp has written a technical paper “Performance Report: Multiprotocol Performance Test of VMware® ESX 3.5 on NetApp Storage Systems” on performance test using FCP, iSCSI, NFSon on Vmware 3.5. Do read the article for good details. I have listed the summary only.

Fibre Channel Protocol Summary

  1. FC achieved up to 9% higher throughput than the other protocols while requiring noticeably lower CPU utilization on the ESX 3.5 host compared to NFS and iSCSI.
  2. FC storage infrastructures are generally the most costly of all the protocols to install and maintain. FC infrastructure requires expensive Fibre Channel switches and Fibre Channel cabling in order to be deployed.

iSCSI Protocol Summary

  1. Using the VMware iSCSI software initiator, we observed performance was at most 7% lower than FC.
  2. Software iSCSI also exhibited the highest maximum ESX 3.5 host CPU utilization of all the protocols tested.
  3. iSCSI  is relatively inexpensive to deploy and maintain. as it is  running on a standard TCP/IP network,

NFS Protocol Summary

  1. NFS performance was at maximum 9% lower than FC. NFS also exhibited ESX 3.5 host server CPU utilization maximum on average higher than FC but lower than iSCSI.
  2. Running on a standard TCP/IP network, NFS does not require the expensive Fibre Channel switches, host bus adapters, and Fibre Channel cabling that FC requires, making NFS a lower cost alternative of the two protocols.
  3. NFS provides further storage efficiencies by allowing on-demand resizing of data stores and increasing storage saving efficiencies gained when using deduplication. Both of these advantages provide additional operational savings as a result of this storage simplification.