Running Linpack (HPL) Test on Linux Cluster with OpenMPI and Intel Compilers

According to HPL Website,

HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution – Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths – Recursive panel factorization with pivot search and column broadcast combined – Various virtual panel broadcast topologies – bandwidth reducing swap-broadcast algorithm – backward substitution with look-ahead of depth 1.

1. Requirements:

  1. MPI (1.1 compliant). For this entry, I’m using OpenMPI
  2. BLAS and VSIPL

2. Installing BLAS, LAPACK and OpenMPI, do look at

  1. Building BLAS Library using Intel and GNU Compiler
  2. Building LAPACK 3.4 with Intel and GNU Compiler
  3. Building OpenMPI with Intel Compilers
  4. Compiling ATLAS on CentOS 5

3. Download the latest HPL (hpl-2.1.tar.gz) from http://www.netlib.org

4. Copy Make.Linux_PII_CBLAS file from  $(HOME)/hpl-2.1/setup/ to $(HOME)/hpl-2.1/

5. Edit Make.Linux_PII_CBLAS file

# vim ~/hpl-2.1/Make.Linux_PII_CBLAS
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = Linux_PII_CBLAS
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = $(HOME)/hpl-2.1
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a

# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /usr/local/mpi/intel
MPinc        = -I$(MPdir)/include
MPlib        = $(MPdir)/lib/libmpi.so
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /usr/local/atlas/lib
LAinc        =
LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
#
.....
.....
.....
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = /usr/local/mpi/intel/bin/mpicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = /usr/local/mpi/intel/bin/mpicc
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------

6. Compile the HPL

# make arch=Linux_PII_CBLAS

Running the LinPack on multiple Nodes

$ cd ~/hpl-2.0/bin/Linux_PII_CBLAS
$ mpirun -np 16 --host node1,node2 ./xhpl

7. The output…..

.....
.....
.....
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00R2R4          35     4     4     1               0.00              4.019e-02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0108762 ...... PASSED
================================================================================

Finished    864 tests with the following results:
864 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Using iostat to report system input and output

The iostat command is used to monitor system input/output device loading by observing the time the device are active in relation to their average transfer rate.

For disk statistics

# iostat -m 2 10 -x /dev/sda1

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.31    0.00    0.50    0.19    0.00   99.00

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda1              0.00    30.00  0.00 14.00     0.00     0.17    25.14     0.85   60.71   5.50   7.70

where

“-m” = Display statistics in megabytes per second
“2 10” = 2 seconds for 10 times
-x =  Display  extended  statistics.

AVG-CPU Statistics

  • “%user” = % of CPU utilisation that occurred while executing at the user level (application)
  • “%nice” = % of CPU utilisation that occurred while executing at the user level with nice priority
  • “%system” = % of CPU utilisation that occurred while executing at the system level (kernel)
  • “%iowait” = % of time CPU were idle during which the system had an outstanding disk I/O request
  • “%steal” = % of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
  • “%idle” = % of time that the CPU or CPUS were idle and the system does not have an outstanding disk I/O request

DEVICE Statistics

  • “rrqm/s” =  The number of read requests merged per second  that  were queued to the device.
  • “wrqm/s” = The  number of write requests merged per second that were queued to the device
  • “r/s” =  The number of read  requests  that  were  issued  to  the device per second.
  • “w/s” = The number of write requests that was issued to the device per second.
  • “rMB/s” = The number of megabytes read from the device per  second.
  • “wMB/s” = The number of megabytes written to the device per second.
  • “avgrq-sz” = The average size (in sectors) of the requests  that  were issued to the device.
  • “avgqu-sz” = The average queue length of the requests that were issued to the device.
  • “await” = The average  time  (in  milliseconds)  for  I/O  requests issued to the device to be served.
  • “svctm” = This field will be depreciated
  • “util” = Percentage of CPU time during  which  I/O  requests  were issued  to  the  device  (bandwidth  utilization  for the device).

Other usage of IOSTAT

1. Total Device Utilisation in MB

# iostat -d -m
Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda              53.72         0.01         0.24     330249    8097912
sda1             53.66         0.01         0.24     307361    7997449
sda2              0.06         0.00         0.00      22046     100253
sda3              0.01         0.00         0.00        841        208
hda               0.00         0.00         0.00       1200          0
  • tps = transfer per second. A transfer is an I/O request to the device
  • MB_read = The total number of megabytes read
  • MB_wrtn = The total number of megabytes written
  • MB_read/s = Indicate the amount of data read from the device expressed in megabytes per second.
  • MB_wrtn/s = Indicate the amount of data written to the device expressed in megabytes per second.
# iostat -n -m
.....
.....
Device:              rMB_nor/s wMB_nor/s rMB_dir/s wMB_dir/s rMB_svr/s wMB_svr/s ops/s rops/s wops/s
NFS_Server:/vol/vol1 0.26      0.16      0.00      0.00      0.17      0.16      14.82 2.96   3.49
  • ops/s = Indicate the number of operations that were issued to the mount point per second
  • rops/s = Indicate the number of read operations that was issued to the mount point per second
  • wops/s = Indicate the number of write operations that were issued to the mount point per second

Installing and Configuring Ganglia on CentOS 5.8

For a basic understanding on ganglia more general information, do look at Installing and configuring Ganglia on CentOS 5.4

Part I: To install ganglia on CentOS 5.8 on the Cluster Head Node, do the followings:

  1. Make sure you have the RPM Repositories installed. For more information, see Useful Repositories for CentOS 5
  2. Make sure installed the libdbi-0.8.1-2.1.x86_64.rpm for CentOS 5.9 be installed on the CentOS 5.8. Apparently, there is no conflict or dependency issues. See RPM Resource libdbi.so.0()(64bit)
    # wget 
    ftp://rpmfind.net/linux/centos/5.9/os/x86_64/CentOS/libdbi-0.8.1-2.1.x86_64.rpm
    # rpm -ivh libdbi-0.8.1-2.1.x86_64.rpm
  3. Install PHP 5.4. See Installing PHP 5.4 on CentOS 5
  4. Install the Ganglia Components
    # yum install rrdtool ganglia ganglia-gmetad ganglia-gmond ganglia-web httpd
  5. By default, Ganglia uses multi-cast or UDP to pass information. I refer to use UDP as I can have better control
  6. Assuming 192.168.1.5 is our head node and port number 8649. Edit /etc/ganglia/gmond.conf and start the gmond service.
    cluster {
    name = "My Cluster"
    owner = "kittycool"
    latlong = "unspecified"
    url = "unspecified"
    }
    udp_send_channel {
    host = 192.168.1.5
    port = 8649
    ttl = 1
    }
    udp_recv_channel {
    port = 8649
    }
  7. Configure the service level startup-up and start the service for gmond
    chkconfig --levels 235 gmond on
    service gmond start
  8. Configure the /etc/ganglia/gmetad.conf to define the datasource
    Data_source "my cluster" 192.168.1.5:8649
  9. Configure the service level startup-up and start the service for gmetad
    chkconfig --levels 235 gmetad on
    service gmetad start
  10. Configure the service level startup-up and start the service for httpd
    chkconfig --levels 235 httpd on
    service httpd start

Part II: To install on the Compute Nodes, I’m assuming the Compute Node are on private network and does not have access to the internet. Only the Head Node has access to internet. I’m also assuming there is no routing from the compute nodes via the head node for internet access

  1. Read the following blog  Using yum to download the rpm package that have already been installed.
  2. Copy the rpm to all the compute nodes
  3. Install the package on each compute nodes*
    yum install ganglia-gmond
  4. Configure the service startup for each compute nodes*
    chkconfig --levels 235 gmond on
  5. Copy the gmond /etc/ganglia/gmond.conf configuration file from head node to the compute node.

For more information, do look at

  1. Installing and configuring Ganglia on CentOS 5.4

Running OpenMPI in oversubscribe nodes

Taken from OpenMPI FAQ  21. Can I oversubscribe nodes (run more processes than processors)?

Open MPI basically runs its message passing progression engine in two modes: aggressive and degraded.

  • Degraded: When Open MPI thinks that it is in an oversubscribed mode (i.e., more processes are running than there are processors available), MPI processes will automatically run in degraded mode and frequently yield the processor to its peers, thereby allowing all processes to make progress.
  • Aggressive: When Open MPI thinks that it is in an exactly- or under-subscribed mode (i.e., the number of running processes is equal to or less than the number of available processors), MPI processes will automatically run in aggressive mode, meaning that they will never voluntarily give up the processor to other processes. With some network transports, this means that Open MPI will spin in tight loops attempting to make message passing progress, effectively causing other processes to not get any CPU cycles (and therefore never make any progress).

Example of Degraded Modes (Running 4 Slots on 1 Physical  cores). MPI knows that there is only 1 slot and 4 MPI process are running on the single slot.

$ cat my-hostfile
localhost slots=1
$ mpirun -np 4 --hostfile my-hostfile a.out

Example of Aggressive Modes (Running 4 slots on 4 or more Physical Cores). MPI knows that there is at least 4 slots for the 4 MPI process.

$ cat my-hostfile
localhost slots=4
$ mpirun -np 4 --hostfile my-hostfile a.out

Handling inputs flies on PBS

Single Input File (Serial run)

If you requires to run your job(s) over different 1 set input files, you have to add the line script into your PBS. Suppose the single input file is data1.ini. The program above will take the input data1.ini and generates an output file data1.out. In your submission scripts

#!/bin/bash
.....
.....
## Use data1.ini as the input file for $file
cd $PBS_O_WORKDIR
mybinaryprogram < data1.ini 1> data1.out 2> data1.err
.....
.....

Multiple Input Files (Serial run)

IF you wish to use multiple input files , you should use the PBS job array parameters. This can be expressed with the -t parameters. The -t option allows many copies of the same script to be queued all at once. You can use the PBS_ARRAYID to differenciate between the different jobs in the array.

Assuming the data file

data1.ini
data2.ini
data3.ini
data4.ini
data5.ini
data6.ini
data7.ini
data8.ini
#!/bin/bash
.....
.....
#PBS -t 1-8
cd $PBS_O_WORKDIR
mybinaryprogram < data${PBS_ARRAYID}.in 1> data${PBS_ARRAYID}.out 2> data${PBS_ARRAYID}.err
.....
.....

For above information is obtained from

  1. Michigan State University HPCC “Advanced Scripting Using PBS Environment Variables”

Running multiple copies of the same job at the same time on PBS

In some situation, you may have to run a job several times. This is true in situation of random number generators.

Single Copy

#!/bin/bash
.....
.....
## Use data1.ini as the input file for $file
cd $PBS_O_WORKDIR
mybinaryprogram 1> mybinaryprogram.out 2> mybinaryprogram.err
.....
.....

Multiple Copies

If we qsub the job more than once, the output will override the results from the previous jobs. You can use the PBS environment PBS_JOBID to create directory and redirect your output to the respective directory

#!/bin/bash
.....
.....
cd $PBS_O_WORKDIR
mkdir $PBS_JOBID
mybinaryprogram 1> $PBS_JOBID/mybinaryprogram.out 2> $PBS_JOBID/mybinaryprogram.err
.....
.....

This should prevent the output overwriting itself.

For above information is obtained from

  1. Michigan State University HPCC “Advanced Scripting Using PBS Environment Variables”

General run-time tuning for Open MPI 1.4 and later (Part1)

Taken from 17. How do I tell Open MPI to use processor and/or memory affinity in Open MPI v1.4.x? (How do I use the –by* and –bind-to-* options?)

During the mpirun, you can put in the parameter of the Open MPI 1.4 and above to improve performance

  1. –bind-to-none: Do not bind processes (Default)
  2. –bind-to-core: Bind each MPI process to a core
  3. –bind-to-socket: Bind each MPI process to a processor socket
  4. –report bindings: Report how the launches processes are bound by Open MPI

If the hardware has multiple hardware threads like those belonging to Hyperthreading, only the first thread of each core is used with the -bind-to-*. According to the article, it is supposed to be fixed in v1.5

The following options below is to be used with –bind-to-*

  1. –byslot: Alias for –bycore
  2. –bycore: When laying out processes, put sequential MPI processes on adjacent processor cores. (Default)
  3. –bysocket: When laying out processes, put sequential MPI processes on adjacent processor sockets.
  4. –bynode: When laying out processes, put sequential MPI processes on adjacent nodes.

Finally you can use the –cpus-per-procs which binds ncpus OS processor IDS to each MPI process. If there is a machine with 4 cores and 4 cores, hence 16 cores in total.

$ mpirun -np 8 --cupus-per-proc 2 my_mpi_process

The command will bind each MPI process to ncpus=2 cores. All cores on the machine will be used.