Tools to Show your System Configuration

Tool 1: Display Information about CPU Architecture

[user1@node1 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
Stepping: 4
CPU MHz: 3200.000
BogoMIPS: 6400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
.....
.....

Tool 2: List all PCI devices

[user1@node1 ~]# lspci -t -vv
-+-[0000:d7]-+-00.0-[d8]--
| +-01.0-[d9]--
| +-02.0-[da]--
| +-03.0-[db]--
| +-05.0 Intel Corporation Device 2034
| +-05.2 Intel Corporation Sky Lake-E RAS Configuration Registers
| +-05.4 Intel Corporation Device 2036
| +-0e.0 Intel Corporation Device 2058
| +-0e.1 Intel Corporation Device 2059
| +-0f.0 Intel Corporation Device 2058
| +-0f.1 Intel Corporation Device 2059
| +-10.0 Intel Corporation Device 2058
| +-10.1 Intel Corporation Device 2059
| +-12.0 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.1 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.2 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.4 Intel Corporation Sky Lake-E M3KTI Registers
| +-12.5 Intel Corporation Sky Lake-E M3KTI Registers
| +-15.0 Intel Corporation Sky Lake-E M2PCI Registers
| +-16.0 Intel Corporation Sky Lake-E M2PCI Registers
| +-16.4 Intel Corporation Sky Lake-E M2PCI Registers
| \-17.0 Intel Corporation Sky Lake-E M2PCI Registers
.....
.....

Tool 3: List all PCI devices

[user@node1 ~]# lsblk
sda 8:0 0 1.1T 0 disk
├─sda1 8:1 0 200M 0 part /boot/efi
├─sda2 8:2 0 1G 0 part /boot
└─sda3 8:3 0 1.1T 0 part
├─centos-root 253:0 0 50G 0 lvm /
├─centos-swap 253:1 0 4G 0 lvm [SWAP]
└─centos-home 253:2 0 1T 0 lvm

Tool 4: See flags kernel booted with

[user@node1 ~]
BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet LANG=en_US.UTF-8

Tool 5: Display available network interfaces

[root@hpc-gekko1 ~]# ifconfig -a
eno1: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
ether XX:XX:XX:XX:XX:XX txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
.....
.....
.....

Tool 6 Using dmidecode to find hardware information
See Using dmidecode to find hardware information

Benchmarking Tools for Memory Bandwidth

What is Bandwidth

Bandwidth, is an artificial benchmark primarily for measuring memory bandwidth on x86 and x86_64 based computers, useful for identifying weaknesses in a computer’s memory subsystem, in the bus architecture, in the cache architecture and in the processor itself.

bandwidth also tests some libc functions and, under GNU/Linux, it attempts to test framebuffer memory access speed if the framebuffer device is available.

Prerequisites:

NASM, GNU Compiler Suite

Compiling NASM

Bandwidth-1.94 requires the latest version of NASM.

% tar -xvf nasm-2.15.05.tar.gz
% cd nasm-2.15.05
% ./configure
% make
% make install

You should have nasm binary. Make sure you update $PATH to reflect the path of the nasm binary

Compiling Bandwidth-1.94

% tar -zxvf bandwidth-1.9.4.tar.gz
% cd bandwidth-1.9.4
% make bandwidth64

You should have bandwidth64 binary

Run the Test

% ./bandwidth64
Sequential read (64-bit), size = 256 B, loops = 1132462080, 55292.9 MB/s
Sequential read (64-bit), size = 384 B, loops = 765632322, 56075.0 MB/s
Sequential read (64-bit), size = 512 B, loops = 573833216, 56028.0 MB/s
Sequential read (64-bit), size = 640 B, loops = 457595948, 55857.6 MB/s
Sequential read (64-bit), size = 768 B, loops = 382990923, 56092.5 MB/s
Sequential read (64-bit), size = 896 B, loops = 326929770, 55865.7 MB/s
Sequential read (64-bit), size = 1024 B, loops = 285671424, 55789.1 MB/s
Sequential read (64-bit), size = 1280 B, loops = 229320072, 55973.6 MB/s
Sequential read (64-bit), size = 2 kB, loops = 143425536, 56016.5 MB/s
Sequential read (64-bit), size = 3 kB, loops = 95550030, 55977.6 MB/s
Sequential read (64-bit), size = 4 kB, loops = 71729152, 56036.7 MB/s
Sequential read (64-bit), size = 6 kB, loops = 47510700, 55667.7 MB/s
Sequential read (64-bit), size = 8 kB, loops = 35856384, 56020.1 MB/s
Sequential read (64-bit), size = 12 kB, loops = 23738967, 55631.2 MB/s
Sequential read (64-bit), size = 16 kB, loops = 17666048, 55199.2 MB/s
Sequential read (64-bit), size = 20 kB, loops = 14139216, 55228.2 MB/s
Sequential read (64-bit), size = 24 kB, loops = 11771760, 55178.0 MB/s
Sequential read (64-bit), size = 28 kB, loops = 10097100, 55212.2 MB/s
Sequential read (64-bit), size = 32 kB, loops = 8679424, 54246.3 MB/s
Sequential read (64-bit), size = 34 kB, loops = 7160732, 47543.7 MB/s
Sequential read (64-bit), size = 36 kB, loops = 6404580, 45029.4 MB/s
Sequential read (64-bit), size = 40 kB, loops = 5729724, 44762.0 MB/s
Sequential read (64-bit), size = 48 kB, loops = 4782960, 44837.4 MB/s
Sequential read (64-bit), size = 64 kB, loops = 3603456, 45042.9 MB/s
Sequential read (64-bit), size = 128 kB, loops = 1806848, 45168.2 MB/s
Sequential read (64-bit), size = 192 kB, loops = 1204753, 45175.8 MB/s
Sequential read (64-bit), size = 256 kB, loops = 897792, 44882.4 MB/s
Sequential read (64-bit), size = 320 kB, loops = 711144, 44435.3 MB/s
Sequential read (64-bit), size = 384 kB, loops = 590070, 44254.7 MB/s
Sequential read (64-bit), size = 512 kB, loops = 440064, 43995.8 MB/s
Sequential read (64-bit), size = 768 kB, loops = 285005, 42741.0 MB/s
Sequential read (64-bit), size = 1024 kB, loops = 170048, 34006.4 MB/s
Sequential read (64-bit), size = 1280 kB, loops = 120615, 30152.0 MB/s
Sequential read (64-bit), size = 1536 kB, loops = 91434, 27427.4 MB/s
Sequential read (64-bit), size = 1792 kB, loops = 77688, 27180.4 MB/s
Sequential read (64-bit), size = 2048 kB, loops = 64320, 25722.9 MB/s
Sequential read (64-bit), size = 2304 kB, loops = 56252, 25313.3 MB/s
Sequential read (64-bit), size = 2560 kB, loops = 49550, 24772.9 MB/s
Sequential read (64-bit), size = 2816 kB, loops = 47334, 26023.8 MB/s
Sequential read (64-bit), size = 3072 kB, loops = 41916, 25142.8 MB/s
Sequential read (64-bit), size = 3328 kB, loops = 37525, 24388.1 MB/s
Sequential read (64-bit), size = 3584 kB, loops = 35982, 25184.6 MB/s
Sequential read (64-bit), size = 4096 kB, loops = 31824, 25457.4 MB/s
Sequential read (64-bit), size = 5120 kB, loops = 25128, 25116.7 MB/s
Sequential read (64-bit), size = 6144 kB, loops = 22460, 26948.8 MB/s
Sequential read (64-bit), size = 7168 kB, loops = 18081, 25309.1 MB/s
Sequential read (64-bit), size = 8192 kB, loops = 14952, 23921.5 MB/s
Sequential read (64-bit), size = 9216 kB, loops = 13692, 24642.6 MB/s
Sequential read (64-bit), size = 10240 kB, loops = 12144, 24280.2 MB/s
Sequential read (64-bit), size = 12288 kB, loops = 9465, 22713.4 MB/s
Sequential read (64-bit), size = 14336 kB, loops = 7628, 21357.8 MB/s
Sequential read (64-bit), size = 15360 kB, loops = 6580, 19735.0 MB/s
Sequential read (64-bit), size = 16384 kB, loops = 6068, 19413.2 MB/s
Sequential read (64-bit), size = 20480 kB, loops = 3636, 14541.5 MB/s
Sequential read (64-bit), size = 21504 kB, loops = 3741, 15711.6 MB/s
Sequential read (64-bit), size = 32768 kB, loops = 1266, 8102.1 MB/s
Sequential read (64-bit), size = 49152 kB, loops = 900, 8640.0 MB/s
Sequential read (64-bit), size = 65536 kB, loops = 566, 7238.3 MB/s
Sequential read (64-bit), size = 73728 kB, loops = 609, 8765.8 MB/s
Sequential read (64-bit), size = 98304 kB, loops = 455, 8726.8 MB/s
Sequential read (64-bit), size = 131072 kB, loops = 331, 8461.2 MB/s

There is an interesting collection of commentaries at https://zsmith.co/bandwidth.php

Installing Julia Programming Language on CentOS 7

Julia is a high-level, high-performance dynamic language for technical computing. The main homepage for Julia can be found at julialang.org.

Step 1: The latest Julia Programming language binary can be found at downloaded at

% wget https://julialang-s3.julialang.org/bin/linux/x64/1.5/julia-1.5.3-linux-x86_64.tar.gz

Step 2: Extract the Julia Tar GZ

% tar -zxvf julia-1.5.3-linux-x86_64.tar.gz

Step 3: Move the unpacked folder to /usr/local

% mv julia-1.5.3 /usr/local

Try Julia

Compiling Quantum ESPRESSO-6.7.0 with BEEF and Intel MPI 2018 on CentOS 7

Step 1: Download Quantum ESPRESSO 6.5.0 from Quantum ESPRESSO v. 6.7

% tar -zxvf q-e-qe-6.7.0.tar.gz

Step 2: Remember to source the Intel Compilers and indicate MKLROOT in your .bashrc

source /usr/local/intel/2018u3/mkl/bin/mklvars.sh intel64
source /usr/local/intel/2018u3/parallel_studio_xe_2018/bin/psxevars.sh intel64
source /usr/local/intel/2018u3/compilers_and_libraries/linux/bin/compilervars.sh intel64
source /usr/local/intel/2018u3/impi/2018.3.222/bin64/mpivars.sh intel64

Step 3a: Install libbeef

  1. Installing libbeef with Intel Compilers

Step 3b: Make a file call setup.sh and copy the contents inside

export F90=mpiifort
export F77=mpiifort
export MPIF90=mpiifort
export CC=mpiicc
export CPP="icc -E"
export CFLAGS=$FCFLAGS
export AR=xiar
#export LDFLAGS="-L/usr/local/intel/2018u3/mkl/lib/intel64"
#export CPPLAGS="-I/usr/local/intel/2018u3/mkl/include/fftw"
export BEEF_LIBS="-L/usr/local/libbeef-0.1.3/lib -lbeef"
export BLAS_LIBS="-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core"
export LAPACK_LIBS="-lmkl_blacs_intelmpi_lp64"
export SCALAPACK_LIBS="-lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64"
#export FFT_LIBS="-lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_scalapack_lp64"
./configure --enable-parallel --enable-openmp --enable-shared --with-scalapack=intel --prefix=/usr/local/espresso-6.7.0 | tee Configure.out
% ./setup.sh
% make all
% make install

(I prefer make all without the “make all -j 16” so I can see all the errors and trapped seriously)

If successful, you will see in your bin folder

Using qperf to measure network bandwidth and latency

qperf is a network bandwidth and latency measurement tool which works over many transports including TCP/IP, RDMA, UDP and SCTP

Step 1: Installing qperf on both Server and Client

Server> # yum install qperf
Client> # yum install qperf

Step 2: Open Up the Firewall on the Server

# firewall-cmd --permanent --add-port=19765-19766/tcp
# firewall-cmd --reload

The Server listens at TCP Port 19765 by default. Do note that once qperf makes a connection, it will create a control port and data port , the default data port is 19765 but we also need to enable a data port to run test which can be 19766.

Step 3a: Have the Server listen (as qperf server)

Server> $ qperf

Step 4: Connect to qperf Server with qperf Client and measure bandwidth

Client > $ qperf -ip 19766 -t 60 qperf_server_ip_address tcp_bw
tcp_bw:
    bw  =  2.52 GB/sec

Step 5: Connect to qperf Server with qperf Client and measure latency

qperf -vvs qperf_server_ip_address tcp_lat
tcp_lat:
latency = 20.7 us
msg_rate = 48.1 K/sec
loc_send_bytes = 48.2 KB
loc_recv_bytes = 48.2 KB
loc_send_msgs = 48,196
loc_recv_msgs = 48,196
rem_send_bytes = 48.2 KB
rem_recv_bytes = 48.2 KB
rem_send_msgs = 48,197
rem_recv_msgs = 48,197

References and material Taken from:

  1. How to use qperf to measure network bandwidth and latency performance?

How to configure NFS on CentOS 7

Step 1: Do a Yum Install

# yum install nfs-utils rpcbind

Step 2: Enable the Service at Boot Time

# systemctl enable nfs-server
# systemctl enable rpcbind
# systemctl enable nfs-lock     (it does not need to be enabled since rpc-statd.service  is static.)
# systemctl enable nfs-idmap    (it does not need to be enabled since nfs-idmapd.service is static.)

Step 3: Start the Services

# systemctl start rpcbind
# systemctl start nfs-server
# systemctl start nfs-lock
# systemctl start nfs-idmap

Step 4: Confirm the status of NFS

# systemctl status nfs

Step 5: Create a mount point

# mkdir /shared-data

Step 6: Exports the Share

# vim /etc/exports
/shared-data 192.168.0.0/16(rw,no_root_squash)

Step 7: Export the Share

# exportfs -rv

Step 8: Restart the NFS Services

# systemctl restart nfs-server

Step 9: Configure the Firewall

# firewall-cmd --add-service=nfs --zone=internal --permanent
# firewall-cmd --add-service=mountd --zone=internal --permanent
# firewall-cmd --add-service=rpc-bind --zone=internal --permanent

References:

  1. How to configure NFS in RHEL 7
  2. What firewalld services should be active on an NFS server in RHEL 7?

Performance Required for Deep Learning

There is this question that I wanted to find out about deep learning. What are essential System, Network, Protocol that will speed up the Training and/or Inferencing. There may not be necessary to employ the same level of requirements from Training to Inferencing and Vice Versa. I have received this information during a Nvidia Presentation

Training:

  1. Scalability requires ultra-fast networking
  2. Same hardware needs as HPC
  3. Extreme network bandwidth
  4. RDMA
  5. SHARP (Mellanox Scalable Hierarchical Aggregation and Reduction Protocol)
  6. GPUDirect (https://developer.nvidia.com/gpudirect)
  7. Fast Access Storage

Influencing

  1. Highly Transactional
  2. Ultra-low Latency
  3. Instant Network Response
  4. RDMA
  5. PeerDirect, GPUDirect