Running Linpack (HPL) Test on Linux Cluster with OpenMPI and Intel Compilers

According to HPL Website,

HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution – Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths – Recursive panel factorization with pivot search and column broadcast combined – Various virtual panel broadcast topologies – bandwidth reducing swap-broadcast algorithm – backward substitution with look-ahead of depth 1.

1. Requirements:

  1. MPI (1.1 compliant). For this entry, I’m using OpenMPI
  2. BLAS and VSIPL

2. Installing BLAS, LAPACK and OpenMPI, do look at

  1. Building BLAS Library using Intel and GNU Compiler
  2. Building LAPACK 3.4 with Intel and GNU Compiler
  3. Building OpenMPI with Intel Compilers
  4. Compiling ATLAS on CentOS 5

3. Download the latest HPL (hpl-2.1.tar.gz) from http://www.netlib.org

4. Copy Make.Linux_PII_CBLAS file from  $(HOME)/hpl-2.1/setup/ to $(HOME)/hpl-2.1/

5. Edit Make.Linux_PII_CBLAS file

# vim ~/hpl-2.1/Make.Linux_PII_CBLAS
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = Linux_PII_CBLAS
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = $(HOME)/hpl-2.1
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a

# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /usr/local/mpi/intel
MPinc        = -I$(MPdir)/include
MPlib        = $(MPdir)/lib/libmpi.so
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /usr/local/atlas/lib
LAinc        =
LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
#
.....
.....
.....
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = /usr/local/mpi/intel/bin/mpicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = /usr/local/mpi/intel/bin/mpicc
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------

6. Compile the HPL

# make arch=Linux_PII_CBLAS

Running the LinPack on multiple Nodes

$ cd ~/hpl-2.0/bin/Linux_PII_CBLAS
$ mpirun -np 16 --host node1,node2 ./xhpl

7. The output…..

.....
.....
.....
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00R2R4          35     4     4     1               0.00              4.019e-02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0108762 ...... PASSED
================================================================================

Finished    864 tests with the following results:
864 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Using iostat to report system input and output

The iostat command is used to monitor system input/output device loading by observing the time the device are active in relation to their average transfer rate.

For disk statistics

# iostat -m 2 10 -x /dev/sda1

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.31    0.00    0.50    0.19    0.00   99.00

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda1              0.00    30.00  0.00 14.00     0.00     0.17    25.14     0.85   60.71   5.50   7.70

where

“-m” = Display statistics in megabytes per second
“2 10” = 2 seconds for 10 times
-x =  Display  extended  statistics.

AVG-CPU Statistics

  • “%user” = % of CPU utilisation that occurred while executing at the user level (application)
  • “%nice” = % of CPU utilisation that occurred while executing at the user level with nice priority
  • “%system” = % of CPU utilisation that occurred while executing at the system level (kernel)
  • “%iowait” = % of time CPU were idle during which the system had an outstanding disk I/O request
  • “%steal” = % of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
  • “%idle” = % of time that the CPU or CPUS were idle and the system does not have an outstanding disk I/O request

DEVICE Statistics

  • “rrqm/s” =  The number of read requests merged per second  that  were queued to the device.
  • “wrqm/s” = The  number of write requests merged per second that were queued to the device
  • “r/s” =  The number of read  requests  that  were  issued  to  the device per second.
  • “w/s” = The number of write requests that was issued to the device per second.
  • “rMB/s” = The number of megabytes read from the device per  second.
  • “wMB/s” = The number of megabytes written to the device per second.
  • “avgrq-sz” = The average size (in sectors) of the requests  that  were issued to the device.
  • “avgqu-sz” = The average queue length of the requests that were issued to the device.
  • “await” = The average  time  (in  milliseconds)  for  I/O  requests issued to the device to be served.
  • “svctm” = This field will be depreciated
  • “util” = Percentage of CPU time during  which  I/O  requests  were issued  to  the  device  (bandwidth  utilization  for the device).

Other usage of IOSTAT

1. Total Device Utilisation in MB

# iostat -d -m
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              53.72         0.01         0.24     330249    8097912
sda1             53.66         0.01         0.24     307361    7997449
sda2              0.06         0.00         0.00      22046     100253
sda3              0.01         0.00         0.00        841        208
hda               0.00         0.00         0.00       1200          0
  • tps = transfer per second. A transfer is an I/O request to the device
  • MB_read = The total number of megabytes read
  • MB_wrtn = The total number of megabytes written
  • MB_read/s = Indicate the amount of data read from the device expressed in megabytes per second.
  • MB_wrtn/s = Indicate the amount of data written to the device expressed in megabytes per second.
# iostat -n -m
.....
.....
Device:              rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s
NFS_Server:/vol/vol1 0.26      0.16      0.00      0.00      0.17      0.16      14.82 2.96   3.49
  • ops/s = Indicate the number of operations that were issued to the mount point per second
  • rops/s = Indicate the number of read operations that was issued to the mount point per second
  • wops/s = Indicate the number of write operations that were issued to the mount point per second

Installing and Configuring Ganglia on CentOS 5.8

For a basic understanding on ganglia more general information, do look at Installing and configuring Ganglia on CentOS 5.4

Part I: To install ganglia on CentOS 5.8 on the Cluster Head Node, do the followings:

  1. Make sure you have the RPM Repositories installed. For more information, see Useful Repositories for CentOS 5
  2. Make sure installed the libdbi-0.8.1-2.1.x86_64.rpm for CentOS 5.9 be installed on the CentOS 5.8. Apparently, there is no conflict or dependency issues. See RPM Resource libdbi.so.0()(64bit)
    # wget 
    ftp://rpmfind.net/linux/centos/5.9/os/x86_64/CentOS/libdbi-0.8.1-2.1.x86_64.rpm
    # rpm -ivh libdbi-0.8.1-2.1.x86_64.rpm
  3. Install PHP 5.4. See Installing PHP 5.4 on CentOS 5
  4. Install the Ganglia Components
    # yum install rrdtool ganglia ganglia-gmetad ganglia-gmond ganglia-web httpd
  5. By default, Ganglia uses multi-cast or UDP to pass information. I refer to use UDP as I can have better control
  6. Assuming 192.168.1.5 is our head node and port number 8649. Edit /etc/ganglia/gmond.conf and start the gmond service.
    cluster {
    name = "My Cluster"
    owner = "kittycool"
    latlong = "unspecified"
    url = "unspecified"
    }
    udp_send_channel {
    host = 192.168.1.5
    port = 8649
    ttl = 1
    }
    udp_recv_channel {
    port = 8649
    }
  7. Configure the service level startup-up and start the service for gmond
    chkconfig --levels 235 gmond on
    service gmond start
  8. Configure the /etc/ganglia/gmetad.conf to define the datasource
    Data_source "my cluster" 192.168.1.5:8649
  9. Configure the service level startup-up and start the service for gmetad
    chkconfig --levels 235 gmetad on
    service gmetad start
  10. Configure the service level startup-up and start the service for httpd
    chkconfig --levels 235 httpd on
    service httpd start

Part II: To install on the Compute Nodes, I’m assuming the Compute Node are on private network and does not have access to the internet. Only the Head Node has access to internet. I’m also assuming there is no routing from the compute nodes via the head node for internet access

  1. Read the following blog  Using yum to download the rpm package that have already been installed.
  2. Copy the rpm to all the compute nodes
  3. Install the package on each compute nodes*
    yum install ganglia-gmond
  4. Configure the service startup for each compute nodes*
    chkconfig --levels 235 gmond on
  5. Copy the gmond /etc/ganglia/gmond.conf configuration file from head node to the compute node.

For more information, do look at

  1. Installing and configuring Ganglia on CentOS 5.4

Running OpenMPI in oversubscribe nodes

Taken from OpenMPI FAQ  21. Can I oversubscribe nodes (run more processes than processors)?

Open MPI basically runs its message passing progression engine in two modes: aggressive and degraded.

  • Degraded: When Open MPI thinks that it is in an oversubscribed mode (i.e., more processes are running than there are processors available), MPI processes will automatically run in degraded mode and frequently yield the processor to its peers, thereby allowing all processes to make progress.
  • Aggressive: When Open MPI thinks that it is in an exactly- or under-subscribed mode (i.e., the number of running processes is equal to or less than the number of available processors), MPI processes will automatically run in aggressive mode, meaning that they will never voluntarily give up the processor to other processes. With some network transports, this means that Open MPI will spin in tight loops attempting to make message passing progress, effectively causing other processes to not get any CPU cycles (and therefore never make any progress).

Example of Degraded Modes (Running 4 Slots on 1 Physical  cores). MPI knows that there is only 1 slot and 4 MPI process are running on the single slot.

$ cat my-hostfile
localhost slots=1
$ mpirun -np 4 --hostfile my-hostfile a.out

Example of Aggressive Modes (Running 4 slots on 4 or more Physical Cores). MPI knows that there is at least 4 slots for the 4 MPI process.

$ cat my-hostfile
localhost slots=4
$ mpirun -np 4 --hostfile my-hostfile a.out

Handling inputs flies on PBS

Single Input File (Serial run)

If you requires to run your job(s) over different 1 set input files, you have to add the line script into your PBS. Suppose the single input file is data1.ini. The program above will take the input data1.ini and generates an output file data1.out. In your submission scripts

#!/bin/bash
.....
.....
## Use data1.ini as the input file for $file
cd $PBS_O_WORKDIR
mybinaryprogram < data1.ini 1> data1.out 2> data1.err
.....
.....

Multiple Input Files (Serial run)

IF you wish to use multiple input files , you should use the PBS job array parameters. This can be expressed with the -t parameters. The -t option allows many copies of the same script to be queued all at once. You can use the PBS_ARRAYID to differenciate between the different jobs in the array.

Assuming the data file

data1.ini
data2.ini
data3.ini
data4.ini
data5.ini
data6.ini
data7.ini
data8.ini
#!/bin/bash
.....
.....
#PBS -t 1-8
cd $PBS_O_WORKDIR
mybinaryprogram < data${PBS_ARRAYID}.in 1> data${PBS_ARRAYID}.out 2> data${PBS_ARRAYID}.err
.....
.....

For above information is obtained from

  1. Michigan State University HPCC “Advanced Scripting Using PBS Environment Variables”

Running multiple copies of the same job at the same time on PBS

In some situation, you may have to run a job several times. This is true in situation of random number generators.

Single Copy

#!/bin/bash
.....
.....
## Use data1.ini as the input file for $file
cd $PBS_O_WORKDIR
mybinaryprogram 1> mybinaryprogram.out 2> mybinaryprogram.err
.....
.....

Multiple Copies

If we qsub the job more than once, the output will override the results from the previous jobs. You can use the PBS environment PBS_JOBID to create directory and redirect your output to the respective directory

#!/bin/bash
.....
.....
cd $PBS_O_WORKDIR
mkdir $PBS_JOBID
mybinaryprogram 1> $PBS_JOBID/mybinaryprogram.out 2> $PBS_JOBID/mybinaryprogram.err
.....
.....

This should prevent the output overwriting itself.

For above information is obtained from

  1. Michigan State University HPCC “Advanced Scripting Using PBS Environment Variables”

General run-time tuning for Open MPI 1.4 and later (Part1)

Taken from 17. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.4.x? (How do I use the –by* and –bind-to-* options?)

During the mpirun, you can put in the parameter of the Open MPI 1.4 and above to improve performance

  1. –bind-to-none: Do not bind processes (Default)
  2. –bind-to-core: Bind each MPI process to a core
  3. –bind-to-socket: Bind each MPI process to a processor socket
  4. –report bindings: Report how the launches processes are bound by Open MPI

If the hardware has multiple hardware threads like those belonging to Hyperthreading, only the first thread of each core is used with the -bind-to-*. According to the article, it is supposed to be fixed in v1.5

The following options below is to be used with –bind-to-*

  1. –byslot: Alias for –bycore
  2. –bycore: When laying out processes, put sequential MPI processes on adjacent processor cores. (Default)
  3. –bysocket: When laying out processes, put sequential MPI processes on adjacent processor sockets.
  4. –bynode: When laying out processes, put sequential MPI processes on adjacent nodes.

Finally you can use the –cpus-per-procs which binds ncpus OS processor IDS to each MPI process. If there is a machine with 4 cores and 4 cores, hence 16 cores in total.

$ mpirun -np 8 --cupus-per-proc 2 my_mpi_process

The command will bind each MPI process to ncpus=2 cores. All cores on the machine will be used.

Diagnostic Tools to diagnose Infiniband Fabric Information

There are a few diagnostic tools to diagnose Infiniband Fabric Information. Use man for the parameters for the

  1. ibnodes – (Show Infiniband nodes in topology)
  2. ibhosts – (Show InfiniBand host nodes in topology)
  3. ibswitches- (Show InfiniBand switch nodes in topology)
  4. ibnetdiscover – (Discover InfiniBand topology)
  5. ibchecknet – (Validate IB subnet and report errors)
  6. ibdiag (Scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices)
  7. perfquery (find errors on a particular or number of HCA’s and switch ports)

ibnodes (Show Infiniband nodes in topology)

ibnodes is a script which either walks the IB subnet topology  or  uses an  already  saved  topology  file  and  extracts the IB nodes (CAs and switches)

# ibnodes
.....
Ca      : 0x0000000000009b02 ports 2 "c00 HCA-1"
Ca      : 0x0000000000005af0 ports 1 "h00 HCA-1"
Switch  : 0x00000000000000fa ports 36 "IBM HSSM" enhanced port 0 lid 19 lmc 0
.....

ibhosts  (Show InfiniBand host nodes in topology)

ibhosts is a script which either walks the IB subnet topology  or  uses an already saved topology file and extracts the CA nodes.

# ibhosts
Ca      : 0x0000000000009b02 ports 2 "c00 HCA-1"
Ca      : 0x0000000000005af0 ports 1 "h00 HCA-1"

ibswitches (Show InfiniBand switch nodes in topology)

ibswitches is a script which either walks the  IB  subnet  topology  or uses an already saved topology file and extracts the switch nodes.

# ibswitches
Switch  : 0x00000000000003fa ports 36 "IBM HSSM" enhanced port 0 lid 19 lmc 0
Switch  : 0x00000000000003cc ports 36 "IBM HSSM" enhanced port 0 lid 16 lmc 0

ibnetdiscover (Discover InfiniBand topology)

ibnetdiscover performs IB subnet discovery and outputs a human readable topology file. GUIDs, node types, and port numbers are displayed  as  well as port LIDs and NodeDescriptions.  All nodes (and links) are displayed (full topology).  Optionally, this utility can be used to list the current connected nodes by nodetype.  The output is printed to standard output unless a topology file is specified.

# ibnetdiscover
#
# Topology file: generated on Mon Jan 28 14:19:57 2013
#
# Initiated from node 0000000000000080 port 0000090300451281

vendid=0x2c9
devid=0xc738
sysimgguid=0x2c90000000000
switchguid=0x2c90000000080(0000000000080)
Switch  36 "S-0002c9030071ba80"         # "MF0;switch-6260a0:SX90Y3245/U1" enhanced port 0 lid 2 lmc 0
[2]     "H-00000000000011e0"[1](00000000000e1)          # "node-c01 HCA-1" lid 3 4xQDR
[3]     "H-00000000000012d0"[1](00000000000d1)          # "node-c02 HCA-1" lid 4 4xQDR
....
....

ibchecknet (Validate IB subnet and report errors)

# ibchecknet
......
......
## Summary: 31 nodes checked, 0 bad nodes found
##          88 ports checked, 59 bad ports found
##          12 ports have errors beyond threshold

perfquery command

The perfquery command is useful for find errors on a particular or number of HCA’s and switch ports. You can also use perfquery to reset HCA and switch port counters.

# Port counters: Lid 1 port 1
PortSelect:......................1
CounterSelect:...................0x1400
SymbolErrorCounter:..............0
LinkErrorRecoveryCounter:........0
LinkDownedCounter:...............0
PortRcvErrors:...................13
PortRcvRemotePhysicalErrors:.....0
PortRcvSwitchRelayErrors:........0
PortXmitDiscards:................0
PortXmitConstraintErrors:........0
PortRcvConstraintErrors:.........0
CounterSelect2:..................0x00
LocalLinkIntegrityErrors:........0
ExcessiveBufferOverrunErrors:....0
VL15Dropped:.....................0
PortXmitData:....................199578830
PortRcvData:.....................504398997
PortXmitPkts:....................15649860
PortRcvPkts:.....................15645526
PortXmitWait:....................0

References:

  1. Appendix B. InfiniBand Fabric Troubleshooting

Diagnostic Tools to diagnose Infiniband Device

There are a few Diagnostic Tools to diagnose Infiniband Devices.

  1. ibv_devinfo (Query RDMA devices)
  2. ibstat (Query basic status of InfiniBand device(s))
  3. ibstatus (Query basic status of InfiniBand device(s))

ibv_devinfo (Query RDMA devices)

Print  information about RDMA devices available for use from userspace.

# ibv_devinfo
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.10.2322
        node_guid:                      0002:c903:0045:1280
        sys_image_guid:                 0002:c903:0045:1283
        vendor_id:                      0x02c9
        vendor_part_id:                 4099
        hw_ver:                         0x0
        board_id:                       IBM0FD0140019
        phys_port_cnt:                  2
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               1
                        port_lmc:               0x00
                        link_layer:             IB

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             IB

ibstat (Query basic status of InfiniBand device(s))

ibstat is a binary which displays basic information obtained  from  the local  IB  driver.  Output  includes LID, SMLID, port state, link width active, and port physical state.

It is similar to the ibstatus  utility  but  implemented  as  a  binary rather  than a script. It has options to list CAs and/or ports and displays more information than ibstatus.

# ibstat
CA 'mlx4_0'
        CA type: MT4099
        Number of ports: 2
        Firmware version: 2.10.2322
        Hardware version: 0
        Node GUID: 0x0002c90300451280
        System image GUID: 0x0002c90300451283
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 1
                LMC: 0
                SM lid: 1
                Capability mask: 0x0251486a
                Port GUID: 0x0002c90300451281
                Link layer: InfiniBand
        Port 2:
                State: Down
                Physical state: Polling
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x02514868
                Port GUID: 0x0002c90300451282
                Link layer: InfiniBand

ibstatus – (Query basic status of InfiniBand device(s))

ibstatus is a script which displays basic information obtained from the local IB driver. Output includes LID, SMLID,  port  state,  link  width active, and port physical state.

# ibstatus
Infiniband device 'mlx4_0' port 1 status:
        default gid:     fe80:0000:0000:0000:0002:c903:0045:1281
        base lid:        0x1
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      InfiniBand

Infiniband device 'mlx4_0' port 2 status:
        default gid:     fe80:0000:0000:0000:0002:c903:0045:1282
        base lid:        0x0
        sm lid:          0x0
        state:           1: DOWN
        phys state:      2: Polling
        rate:            40 Gb/sec (4X QDR)
        link_layer:      InfiniBand

Sample PBS Scripts for R

Here is a sample of PBS Scripts that can be used for R. This is just a suggested PBS script. Modify and comment at will. The script below is named as R.sh

#!/bin/bash
#PBS -N R-job
#PBS -j oe
#PBS -V
#PBS -m bea
#PBS -M myemail@hotmail.com
#PBS -l nodes=1:ppn=8

# comment these out if you wish
echo "qsub host = " $PBS_O_HOST
echo "original queue = " $PBS_O_QUEUE
echo "qsub working directory absolute = " $PBS_O_WORKDIR
echo "pbs environment = " $PBS_ENVIRONMENT
echo "pbs batch = " $PBS_JOBID
echo "pbs job name from me = " $PBS_JOBNAME
echo "Name of file containing nodes = " $PBS_NODEFILE
echo "contents of nodefile = " $PBS_NODEFILE
echo "Name of queue to which job went = " $PBS_QUEUE

# Pre-processing script
cd $PBS_O_WORKDIR
NCPUS=`cat $PBS_NODEFILE | wc -l`
echo "Number of requested processors = " $NCPUS

# Load R Module
module load mpi/intel_1.4.3
module load intel/12.0.2
module load R/R-2.15.1

# ###############
# Execute Program
# ################
/usr/local/R-2.15.1/bin/R CMD BATCH $file

The corresponding qsub command and its parameter should be something like

$ qsub -q dqueue -l nodes=1:ppn=8 R.sh -v file=Rjob.r