Using Intel IMB-MPI1 to check Fabrics and expected performances

In your .bashrc, do source the

source /usr/local/intel_2015/parallel_studio_xe_2015/bin/psxevars.sh intel64
source /usr/local/intel_2015/impi/5.0.3.049/bin64/mpivars.sh intel64
source /usr/local/intel_2015/composerxe/bin/compilervars.sh intel64
source /usr/local/intel_2015/mkl/bin/mklvars.sh intel64
MKLROOT=/usr/local/intel_2015/mkl

To simulate 3 workloads pingpong, sendrecv, and exchange with IMB-MPT1

$ mpirun -r ssh -RDMA -n 512 -env I_MPI_DEBUG 5 IMB-MPT1

 

Running Linpack (HPL) Test on Linux Cluster with OpenMPI and Intel Compilers

According to HPL Website,

HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution – Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths – Recursive panel factorization with pivot search and column broadcast combined – Various virtual panel broadcast topologies – bandwidth reducing swap-broadcast algorithm – backward substitution with look-ahead of depth 1.

1. Requirements:

  1. MPI (1.1 compliant). For this entry, I’m using OpenMPI
  2. BLAS and VSIPL

2. Installing BLAS, LAPACK and OpenMPI, do look at

  1. Building BLAS Library using Intel and GNU Compiler
  2. Building LAPACK 3.4 with Intel and GNU Compiler
  3. Building OpenMPI with Intel Compilers
  4. Compiling ATLAS on CentOS 5

3. Download the latest HPL (hpl-2.1.tar.gz) from http://www.netlib.org

4. Copy Make.Linux_PII_CBLAS file from  $(HOME)/hpl-2.1/setup/ to $(HOME)/hpl-2.1/

5. Edit Make.Linux_PII_CBLAS file

# vim ~/hpl-2.1/Make.Linux_PII_CBLAS
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = Linux_PII_CBLAS
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = $(HOME)/hpl-2.1
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a

# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /usr/local/mpi/intel
MPinc        = -I$(MPdir)/include
MPlib        = $(MPdir)/lib/libmpi.so
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /usr/local/atlas/lib
LAinc        =
LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
#
.....
.....
.....
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = /usr/local/mpi/intel/bin/mpicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = /usr/local/mpi/intel/bin/mpicc
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------

6. Compile the HPL

# make arch=Linux_PII_CBLAS

Running the LinPack on multiple Nodes

$ cd ~/hpl-2.0/bin/Linux_PII_CBLAS
$ mpirun -np 16 --host node1,node2 ./xhpl

7. The output…..

.....
.....
.....
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00R2R4          35     4     4     1               0.00              4.019e-02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0108762 ...... PASSED
================================================================================

Finished    864 tests with the following results:
864 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Using iperf to measure the bandwidth and quality of network

According to iperf project site. This writeup is taken from iPerf Tutorial by OpenManiak. For a more detailed and in-depth writeup, do real up the iPerf Tutorial 

Iperf was developed by NLANR/DAST as a modern alternative for measuring maximum TCP and UDP bandwidth performance. Iperf allows the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss.

Iperf can generate traffic using TCP and UDP Traaffic to perform the following kinds of test

  • Latency (response time or RTT): can be measured with the Ping utility.
  • Jitter: can be measured with an Iperf UDP test.
  • Datagram loss: can again, be measured with an Iperf UDP test.
  • Bandwidth tests are done using the Iperf TCP tests

Iperf uses the unique characteristics of TCP and UDP to provide statistics about network links. (TCP checks that the packets are correct sent to the receiver. UDP is sent without any checks.

Iperf can be easily installed on the linux box. After downloading the package,  you can do a

# tar -zxvf iperf-2.0.5.tar.gz
# cd iperf-2.0.5
# ./configure
# make
# make install
# cd src

IPerf follows a client-server model. The Server or the Client can be linux or windows. Since this blog is linux, our server and client will be both linux.

Do note that the ipef client connects to the iperf server through port 5001. The bandwidth is from the client to the server.

1. Single Data Uni-Direction with Data Formatting

On the Client, we can use the following format

  1. -f argument display the results in the desired format
  2. The following parameter for formatting ( bits(b), bytes(B), kilobits(k), kilobytes(K), megabits(m), megabytes(M), gigabits(g) or gigabytes(G).
# iperf -c 192.168.50.1 -f G

On the Server, we just use

# iperf -s

2. Bi-directional bandwidth measurement (-r parameter )

By default, the connection from client connection to the server is measured. But with the “-r” argument inclusion, the iperf server will re-connects back to the client thus allowing the bi-drectional measurement.

On the Client Side

# iperf -c 192.168.50.1 -r -f G

On the Server Side

# iperf -s

3. Simultaneous bi-directional bandwidth measurement: (-d argument)

# iperf -c 192.168.50.1 -d -f G

On the Server Side

# iperf -s

4. Interval Settings ( -t timing, -i interval)

On the Client Side, 

# iperf -c 192.168.50.1 -t 20 -i 1

On the Server Side

# iperf - s

5. UDP Settings (-u) and Bandwidth Settings (-b)

The UDP tests with the -u argument will give invaluable information about the jitter and the packet loss. If there is no -u parameter, iperf will default to TCP

On the Client Side

# iperf -c 192.168.50.1 -u -b 10m

On the Server side, (-i interval)

# iperf -c 192.168.50.1 -u -i 2

6. Parallel tests (-P argument, number of parallel):

On Client side

# iperf -c 192.168.50.1 -P 4

On Server  side,

# iperf -s

Testing the Infiniband Interconnect Performance with Intel MPI Benchmark (Part II)

This is a continuation of the article Testing the Infiniband Interconnect Performance with Intel MPI Benchmark (Part I)

B. Running IMB

After “make” the executable has been located. Run IMB_MPI1 pingpong from management node or head node. Ensure the IMB-MPT1 is on the directory.

# cd /home/hpc/imb/src
# mpirun -np 16 -host node1,node2 /home/hpc/imb/src/IMB-MPI1 pingpong
# mpirun -np 16 -host node1,node2 /home/hpc/imb/src/IMB-MPI1 sendrecv
# mpirun -np 16 -host node1,node2 /home/hpc/imb/src/IMB-MPI1 exchange

Example of output from “pingpong”

benchmarks to run pingpong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.2, MPI-1 part
#---------------------------------------------------
# Date                  : Mon Feb  7 10:42:48 2011
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.18-164.el5
# Version               : #1 SMP Thu Sep 3 03:28:30 EDT 2009
# MPI Version           : 2.1
# MPI Thread Environment: MPI_THREAD_SINGLE

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# /home/shared-rpm/imb/src/IMB-MPI1 pingpong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 46 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions      t[usec]   Mbytes/sec
0         1000         8.74         0.00
1         1000         8.82         0.11
2         1000         8.83         0.22
4         1000         8.89         0.43
8         1000         8.90         0.86
16         1000         8.99         1.70
32         1000         9.00         3.39
64         1000        10.32         5.91
128         1000        10.52        11.60
256         1000        11.24        21.72
512         1000        12.12        40.30
1024         1000        13.76        70.98
2048         1000        15.55       125.59
4096         1000        17.81       219.35
8192         1000        22.47       347.67
16384         1000        45.24       345.41
32768         1000        59.83       522.29
65536          640        87.68       712.85
131072          320       154.80       807.47
262144          160       312.87       799.05
524288           80       556.20       898.96
1048576           40      1078.94       926.84
2097152           20      2151.90       929.41
4194304           10      4256.70       939.69

# All processes entering MPI_Finalize

If you wish to use the torque to run the IMB, do read the IBM “Setting up an HPC cluster with Red Hat Enterprise Linux

Testing the Infiniband Interconnect Performance with Intel MPI Benchmark (Part I)

This writeup focuses on verifying the performance of the Infiniband Interconnects or RDMA/iWARP Interconnects as well. The material is modified from IBM Portal “Setting up an HPC cluster with Red Hat Enterprise Linux

A. Building Intel MPI Benchmark (“IMB”)

IMB can be run on a single node or several nodes. 2 or more nodes will be required to test message passing between nodes

Step 1: Download the IMB

1. Go to Intel® MPI Benchmarks 3.2.2 and download the software

2. Untar the package to a shared directory used by the nodes

# tar -zxvf IMB_3.2.2.tar.gz -C /home/hpc

3. Change directory to source directory

# cd /home/hpc/imb/src

4. Edit the make_ict makefile to change the assignment of the CC value from mpiic to mpicc as shown

LIB_PATH    =
LIBS        =
CC          = mpicc
ifeq (,$(shell which ${CC}))
$(error ${CC} is not defined through the PATH environment variable setting. Please try sourcing an Intel(r) Cluster Tools script file such as "mpivars.[c]sh" or "ictvars.[c]sh")
endif
OPTFLAGS    =
CLINKER     = ${CC}
LDFLAGS     =
CPPFLAGS    =

export CC LIB_PATH LIBS OPTFLAGS CLINKER LDFLAGS CPPFLAGS
include Makefile.base

5. Type “make” at /home/hpc/imb/src

# make

You should see a IMB-MPT1 executable. If you cannot find it, do use the command “locate” or “find” to locate executable.

See Testing the Infiniband Interconnect Performance with Intel MPI Benchmark (Part II) for the 2nd Part of the Article.