The SPEChpc 2021 Benchmark suite

The full writeup can be found at REAL-WORLD HPC GETS THE BENCHMARK IT DESERVES

While nothing can beat the notoriety of the long-standing LINPACK benchmark, the metric by which supercomputer performance is gauged, there is ample room for a more practical measure. It might not garner the same mainstream headlines as the Top 500 list of the world’s largest systems, but a new benchmark may fill in the gaps between real-world versus theoretical peak compute performance.

The reason this new high performance computing (HPC) benchmark can come out of the gate with immediate legitimacy is because it is from the Standard Performance Evaluation Corporation (SPEC) organization, which has been delivering system benchmark suites since the late 1980s. And the reason it is big news today is because the time is right for a more functional, real-world measure, especially one that can adequately address the range of architectures and changes in HPC (from various accelerators to new steps toward mixed precision, for example).

…..
…..
…..

The SPEChpc 2021 suite includes a broad swath of science and engineering codes that are representative (and portable ) across much of what we see in HPC.

– A tested set of benchmarks with performance measurement and validation built into the test harness.
– Benchmarks include full and mini applications covering a wide range of scientific domains and Fortran/C/C++ programming languages.
– Comprehensive support for multiple programming models, including MPI, MPI+OpenACC, MPI+OpenMP, and MPI+OpenMP with target offload.
– Support for most major compilers, MPI libraries, and different flavors of Linux operating systems.
– Four suites, Tiny, Small, Medium, and Large, with increasing workload sizes, allows for appropriate evaluation of different-sized HPC systems, ranging from a single node to many thousands of nodes.

REAL-WORLD HPC GETS THE BENCHMARK IT DESERVES at The Next Platform

For more information, see https://www.spec.org/hpc2021/

The MLPERF Benchmark Is Good For AI

Commissioned In just about any situation where you are making capital investments in equipment, you are worried about three things: performance, price/performance, and total cost of ownership. Without some sort of benchmark on which to gauge performance and without some sense of relative pricing, it is impossible to calculate total cost of ownership, and therefore, it is impossible to try to figure out what to invest the budget in.

This is why the MLPerf benchmark suite is so important. MLPerf was created only three and a half years ago by researchers and engineers from Baidu, Google, Harvard University, Stanford University, and the University of California Berkeley and it is now administered by the MLCommons consortium, formed in December 2020. Very quickly, it has become a key suite of tests that hardware and software vendors use to demonstrate the performance of their AI systems and that end user customers depend on to help them make architectural choices for their AI systems.

Next Platform “Why the MLPerf Benchmark is good for AI, and good for you.”

The MLPerf site can be found at https://mlcommons.org/en/

Benchmarking Tools for Memory Bandwidth

What is Bandwidth

Bandwidth, is an artificial benchmark primarily for measuring memory bandwidth on x86 and x86_64 based computers, useful for identifying weaknesses in a computer’s memory subsystem, in the bus architecture, in the cache architecture and in the processor itself.

bandwidth also tests some libc functions and, under GNU/Linux, it attempts to test framebuffer memory access speed if the framebuffer device is available.

Prerequisites:

NASM, GNU Compiler Suite

Compiling NASM

Bandwidth-1.94 requires the latest version of NASM.

% tar -xvf nasm-2.15.05.tar.gz
% cd nasm-2.15.05
% ./configure
% make
% make install

You should have nasm binary. Make sure you update $PATH to reflect the path of the nasm binary

Compiling Bandwidth-1.94

% tar -zxvf bandwidth-1.9.4.tar.gz
% cd bandwidth-1.9.4
% make bandwidth64

You should have bandwidth64 binary

Run the Test

% ./bandwidth64
Sequential read (64-bit), size = 256 B, loops = 1132462080, 55292.9 MB/s
Sequential read (64-bit), size = 384 B, loops = 765632322, 56075.0 MB/s
Sequential read (64-bit), size = 512 B, loops = 573833216, 56028.0 MB/s
Sequential read (64-bit), size = 640 B, loops = 457595948, 55857.6 MB/s
Sequential read (64-bit), size = 768 B, loops = 382990923, 56092.5 MB/s
Sequential read (64-bit), size = 896 B, loops = 326929770, 55865.7 MB/s
Sequential read (64-bit), size = 1024 B, loops = 285671424, 55789.1 MB/s
Sequential read (64-bit), size = 1280 B, loops = 229320072, 55973.6 MB/s
Sequential read (64-bit), size = 2 kB, loops = 143425536, 56016.5 MB/s
Sequential read (64-bit), size = 3 kB, loops = 95550030, 55977.6 MB/s
Sequential read (64-bit), size = 4 kB, loops = 71729152, 56036.7 MB/s
Sequential read (64-bit), size = 6 kB, loops = 47510700, 55667.7 MB/s
Sequential read (64-bit), size = 8 kB, loops = 35856384, 56020.1 MB/s
Sequential read (64-bit), size = 12 kB, loops = 23738967, 55631.2 MB/s
Sequential read (64-bit), size = 16 kB, loops = 17666048, 55199.2 MB/s
Sequential read (64-bit), size = 20 kB, loops = 14139216, 55228.2 MB/s
Sequential read (64-bit), size = 24 kB, loops = 11771760, 55178.0 MB/s
Sequential read (64-bit), size = 28 kB, loops = 10097100, 55212.2 MB/s
Sequential read (64-bit), size = 32 kB, loops = 8679424, 54246.3 MB/s
Sequential read (64-bit), size = 34 kB, loops = 7160732, 47543.7 MB/s
Sequential read (64-bit), size = 36 kB, loops = 6404580, 45029.4 MB/s
Sequential read (64-bit), size = 40 kB, loops = 5729724, 44762.0 MB/s
Sequential read (64-bit), size = 48 kB, loops = 4782960, 44837.4 MB/s
Sequential read (64-bit), size = 64 kB, loops = 3603456, 45042.9 MB/s
Sequential read (64-bit), size = 128 kB, loops = 1806848, 45168.2 MB/s
Sequential read (64-bit), size = 192 kB, loops = 1204753, 45175.8 MB/s
Sequential read (64-bit), size = 256 kB, loops = 897792, 44882.4 MB/s
Sequential read (64-bit), size = 320 kB, loops = 711144, 44435.3 MB/s
Sequential read (64-bit), size = 384 kB, loops = 590070, 44254.7 MB/s
Sequential read (64-bit), size = 512 kB, loops = 440064, 43995.8 MB/s
Sequential read (64-bit), size = 768 kB, loops = 285005, 42741.0 MB/s
Sequential read (64-bit), size = 1024 kB, loops = 170048, 34006.4 MB/s
Sequential read (64-bit), size = 1280 kB, loops = 120615, 30152.0 MB/s
Sequential read (64-bit), size = 1536 kB, loops = 91434, 27427.4 MB/s
Sequential read (64-bit), size = 1792 kB, loops = 77688, 27180.4 MB/s
Sequential read (64-bit), size = 2048 kB, loops = 64320, 25722.9 MB/s
Sequential read (64-bit), size = 2304 kB, loops = 56252, 25313.3 MB/s
Sequential read (64-bit), size = 2560 kB, loops = 49550, 24772.9 MB/s
Sequential read (64-bit), size = 2816 kB, loops = 47334, 26023.8 MB/s
Sequential read (64-bit), size = 3072 kB, loops = 41916, 25142.8 MB/s
Sequential read (64-bit), size = 3328 kB, loops = 37525, 24388.1 MB/s
Sequential read (64-bit), size = 3584 kB, loops = 35982, 25184.6 MB/s
Sequential read (64-bit), size = 4096 kB, loops = 31824, 25457.4 MB/s
Sequential read (64-bit), size = 5120 kB, loops = 25128, 25116.7 MB/s
Sequential read (64-bit), size = 6144 kB, loops = 22460, 26948.8 MB/s
Sequential read (64-bit), size = 7168 kB, loops = 18081, 25309.1 MB/s
Sequential read (64-bit), size = 8192 kB, loops = 14952, 23921.5 MB/s
Sequential read (64-bit), size = 9216 kB, loops = 13692, 24642.6 MB/s
Sequential read (64-bit), size = 10240 kB, loops = 12144, 24280.2 MB/s
Sequential read (64-bit), size = 12288 kB, loops = 9465, 22713.4 MB/s
Sequential read (64-bit), size = 14336 kB, loops = 7628, 21357.8 MB/s
Sequential read (64-bit), size = 15360 kB, loops = 6580, 19735.0 MB/s
Sequential read (64-bit), size = 16384 kB, loops = 6068, 19413.2 MB/s
Sequential read (64-bit), size = 20480 kB, loops = 3636, 14541.5 MB/s
Sequential read (64-bit), size = 21504 kB, loops = 3741, 15711.6 MB/s
Sequential read (64-bit), size = 32768 kB, loops = 1266, 8102.1 MB/s
Sequential read (64-bit), size = 49152 kB, loops = 900, 8640.0 MB/s
Sequential read (64-bit), size = 65536 kB, loops = 566, 7238.3 MB/s
Sequential read (64-bit), size = 73728 kB, loops = 609, 8765.8 MB/s
Sequential read (64-bit), size = 98304 kB, loops = 455, 8726.8 MB/s
Sequential read (64-bit), size = 131072 kB, loops = 331, 8461.2 MB/s

There is an interesting collection of commentaries at https://zsmith.co/bandwidth.php

Using Intel IMB-MPI1 to check Fabrics and expected performances

In your .bashrc, do source the

source /usr/local/intel_2015/parallel_studio_xe_2015/bin/psxevars.sh intel64
source /usr/local/intel_2015/impi/5.0.3.049/bin64/mpivars.sh intel64
source /usr/local/intel_2015/composerxe/bin/compilervars.sh intel64
source /usr/local/intel_2015/mkl/bin/mklvars.sh intel64
MKLROOT=/usr/local/intel_2015/mkl

To simulate 3 workloads pingpong, sendrecv, and exchange with IMB-MPT1

$ mpirun -r ssh -RDMA -n 512 -env I_MPI_DEBUG 5 IMB-MPT1

 

Running Linpack (HPL) Test on Linux Cluster with OpenMPI and Intel Compilers

According to HPL Website,

HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark.

The algorithm used by HPL can be summarized by the following keywords: Two-dimensional block-cyclic data distribution – Right-looking variant of the LU factorization with row partial pivoting featuring multiple look-ahead depths – Recursive panel factorization with pivot search and column broadcast combined – Various virtual panel broadcast topologies – bandwidth reducing swap-broadcast algorithm – backward substitution with look-ahead of depth 1.

1. Requirements:

  1. MPI (1.1 compliant). For this entry, I’m using OpenMPI
  2. BLAS and VSIPL

2. Installing BLAS, LAPACK and OpenMPI, do look at

  1. Building BLAS Library using Intel and GNU Compiler
  2. Building LAPACK 3.4 with Intel and GNU Compiler
  3. Building OpenMPI with Intel Compilers
  4. Compiling ATLAS on CentOS 5

3. Download the latest HPL (hpl-2.1.tar.gz) from http://www.netlib.org

4. Copy Make.Linux_PII_CBLAS file from  $(HOME)/hpl-2.1/setup/ to $(HOME)/hpl-2.1/

5. Edit Make.Linux_PII_CBLAS file

# vim ~/hpl-2.1/Make.Linux_PII_CBLAS
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL        = /bin/sh
#
CD           = cd
CP           = cp
LN_S         = ln -s
MKDIR        = mkdir
RM           = /bin/rm -f
TOUCH        = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH         = Linux_PII_CBLAS
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir       = $(HOME)/hpl-2.1
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
#
HPLlib       = $(LIBdir)/libhpl.a

# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the  C  compiler where to find the Message Passing library
# header files,  MPlib  is defined  to be the name of  the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir        = /usr/local/mpi/intel
MPinc        = -I$(MPdir)/include
MPlib        = $(MPdir)/lib/libmpi.so
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the  C  compiler where to find the Linear Algebra  library
# header files,  LAlib  is defined  to be the name of  the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir        = /usr/local/atlas/lib
LAinc        =
LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
#
.....
.....
.....
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC           = /usr/local/mpi/intel/bin/mpicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
#
# On some platforms,  it is necessary  to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER       = /usr/local/mpi/intel/bin/mpicc
LINKFLAGS    = $(CCFLAGS)
#
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo
#
# ----------------------------------------------------------------------

6. Compile the HPL

# make arch=Linux_PII_CBLAS

Running the LinPack on multiple Nodes

$ cd ~/hpl-2.0/bin/Linux_PII_CBLAS
$ mpirun -np 16 --host node1,node2 ./xhpl

7. The output…..

.....
.....
.....
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR00R2R4          35     4     4     1               0.00              4.019e-02
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0108762 ...... PASSED
================================================================================

Finished    864 tests with the following results:
864 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

Using iperf to measure the bandwidth and quality of network

According to iperf project site. This writeup is taken from iPerf Tutorial by OpenManiak. For a more detailed and in-depth writeup, do real up the iPerf Tutorial 

Iperf was developed by NLANR/DAST as a modern alternative for measuring maximum TCP and UDP bandwidth performance. Iperf allows the tuning of various parameters and UDP characteristics. Iperf reports bandwidth, delay jitter, datagram loss.

Iperf can generate traffic using TCP and UDP Traaffic to perform the following kinds of test

  • Latency (response time or RTT): can be measured with the Ping utility.
  • Jitter: can be measured with an Iperf UDP test.
  • Datagram loss: can again, be measured with an Iperf UDP test.
  • Bandwidth tests are done using the Iperf TCP tests

Iperf uses the unique characteristics of TCP and UDP to provide statistics about network links. (TCP checks that the packets are correct sent to the receiver. UDP is sent without any checks.

Iperf can be easily installed on the linux box. After downloading the package,  you can do a

# tar -zxvf iperf-2.0.5.tar.gz
# cd iperf-2.0.5
# ./configure
# make
# make install
# cd src

IPerf follows a client-server model. The Server or the Client can be linux or windows. Since this blog is linux, our server and client will be both linux.

Do note that the ipef client connects to the iperf server through port 5001. The bandwidth is from the client to the server.

1. Single Data Uni-Direction with Data Formatting

On the Client, we can use the following format

  1. -f argument display the results in the desired format
  2. The following parameter for formatting ( bits(b), bytes(B), kilobits(k), kilobytes(K), megabits(m), megabytes(M), gigabits(g) or gigabytes(G).
# iperf -c 192.168.50.1 -f G

On the Server, we just use

# iperf -s

2. Bi-directional bandwidth measurement (-r parameter )

By default, the connection from client connection to the server is measured. But with the “-r” argument inclusion, the iperf server will re-connects back to the client thus allowing the bi-drectional measurement.

On the Client Side

# iperf -c 192.168.50.1 -r -f G

On the Server Side

# iperf -s

3. Simultaneous bi-directional bandwidth measurement: (-d argument)

# iperf -c 192.168.50.1 -d -f G

On the Server Side

# iperf -s

4. Interval Settings ( -t timing, -i interval)

On the Client Side, 

# iperf -c 192.168.50.1 -t 20 -i 1

On the Server Side

# iperf - s

5. UDP Settings (-u) and Bandwidth Settings (-b)

The UDP tests with the -u argument will give invaluable information about the jitter and the packet loss. If there is no -u parameter, iperf will default to TCP

On the Client Side

# iperf -c 192.168.50.1 -u -b 10m

On the Server side, (-i interval)

# iperf -c 192.168.50.1 -u -i 2

6. Parallel tests (-P argument, number of parallel):

On Client side

# iperf -c 192.168.50.1 -P 4

On Server  side,

# iperf -s

Testing the Infiniband Interconnect Performance with Intel MPI Benchmark (Part II)

This is a continuation of the article Testing the Infiniband Interconnect Performance with Intel MPI Benchmark (Part I)

B. Running IMB

After “make” the executable has been located. Run IMB_MPI1 pingpong from management node or head node. Ensure the IMB-MPT1 is on the directory.

# cd /home/hpc/imb/src
# mpirun -np 16 -host node1,node2 /home/hpc/imb/src/IMB-MPI1 pingpong
# mpirun -np 16 -host node1,node2 /home/hpc/imb/src/IMB-MPI1 sendrecv
# mpirun -np 16 -host node1,node2 /home/hpc/imb/src/IMB-MPI1 exchange

Example of output from “pingpong”

benchmarks to run pingpong
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.2.2, MPI-1 part
#---------------------------------------------------
# Date                  : Mon Feb  7 10:42:48 2011
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.18-164.el5
# Version               : #1 SMP Thu Sep 3 03:28:30 EDT 2009
# MPI Version           : 2.1
# MPI Thread Environment: MPI_THREAD_SINGLE

# New default behavior from Version 3.2 on:

# the number of iterations per message size is cut down
# dynamically when a certain run time (per message size sample)
# is expected to be exceeded. Time limit is defined by variable
# "SECS_PER_SAMPLE" (=> IMB_settings.h)
# or through the flag => -time

# Calling sequence was:

# /home/shared-rpm/imb/src/IMB-MPI1 pingpong

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# PingPong

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
# ( 46 additional processes waiting in MPI_Barrier)
#---------------------------------------------------
#bytes #repetitions      t[usec]   Mbytes/sec
0         1000         8.74         0.00
1         1000         8.82         0.11
2         1000         8.83         0.22
4         1000         8.89         0.43
8         1000         8.90         0.86
16         1000         8.99         1.70
32         1000         9.00         3.39
64         1000        10.32         5.91
128         1000        10.52        11.60
256         1000        11.24        21.72
512         1000        12.12        40.30
1024         1000        13.76        70.98
2048         1000        15.55       125.59
4096         1000        17.81       219.35
8192         1000        22.47       347.67
16384         1000        45.24       345.41
32768         1000        59.83       522.29
65536          640        87.68       712.85
131072          320       154.80       807.47
262144          160       312.87       799.05
524288           80       556.20       898.96
1048576           40      1078.94       926.84
2097152           20      2151.90       929.41
4194304           10      4256.70       939.69

# All processes entering MPI_Finalize

If you wish to use the torque to run the IMB, do read the IBM “Setting up an HPC cluster with Red Hat Enterprise Linux

Testing the Infiniband Interconnect Performance with Intel MPI Benchmark (Part I)

This writeup focuses on verifying the performance of the Infiniband Interconnects or RDMA/iWARP Interconnects as well. The material is modified from IBM Portal “Setting up an HPC cluster with Red Hat Enterprise Linux

A. Building Intel MPI Benchmark (“IMB”)

IMB can be run on a single node or several nodes. 2 or more nodes will be required to test message passing between nodes

Step 1: Download the IMB

1. Go to Intel® MPI Benchmarks 3.2.2 and download the software

2. Untar the package to a shared directory used by the nodes

# tar -zxvf IMB_3.2.2.tar.gz -C /home/hpc

3. Change directory to source directory

# cd /home/hpc/imb/src

4. Edit the make_ict makefile to change the assignment of the CC value from mpiic to mpicc as shown

LIB_PATH    =
LIBS        =
CC          = mpicc
ifeq (,$(shell which ${CC}))
$(error ${CC} is not defined through the PATH environment variable setting. Please try sourcing an Intel(r) Cluster Tools script file such as "mpivars.[c]sh" or "ictvars.[c]sh")
endif
OPTFLAGS    =
CLINKER     = ${CC}
LDFLAGS     =
CPPFLAGS    =

export CC LIB_PATH LIBS OPTFLAGS CLINKER LDFLAGS CPPFLAGS
include Makefile.base

5. Type “make” at /home/hpc/imb/src

# make

You should see a IMB-MPT1 executable. If you cannot find it, do use the command “locate” or “find” to locate executable.

See Testing the Infiniband Interconnect Performance with Intel MPI Benchmark (Part II) for the 2nd Part of the Article.