Installing and using Mellanox HPC-X Software Toolkit

Overview

Taken from Mellanox HPC-X Software Toolkit User Manual 2.3

Mellanox HPC-X is a comprehensive software package that includes MPI and SHMEM communication libraries. HPC-X includes various acceleration packages to improve both the performance and scalability of applications running on top of these libraries, including UCX (Unified Communication X) and MXM (Mellanox Messaging), which accelerate the underlying send/receive (or put/get) messages. It also includes FCA (Fabric Collectives Accelerations), which accelerates the underlying collective operations used by the MPI/PGAS languages.

Download

https://www.mellanox.com/products/hpc-x-toolkit

Installation

% tar -xvf hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0-redhat7.6-x86_64.tbz
% cd hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0-redhat7.6-x86_64
% export HPCX_HOME=/usr/local/hpcx-v2.6.0-gcc-MLNX_OFED_LINUX-5.0-1.0.0.0-redhat7.6-x86_64

Loading HPC-X Environment from BASH

HPC-X includes Open MPI v4.0.x. Each Open MPI version has its own module file which can be used to load the desired version

% source $HPCX_HOME/hpcx-init.sh
% hpcx_load
% env | grep HPCX
% mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c
% mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
% oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/
% hello_oshmem_c
% oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c
% hpcx_unload

Loading HPC-X Environment from Modules

You can use the already built module files in hpcx.

% module use $HPCX_HOME/modulefiles
% module load hpcx
% mpicc $HPCX_MPI_TESTS_DIR/examples/hello_c.c -o $HPCX_MPI_TESTS_DIR/examples/hello_c
% mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
% oshcc $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c.c -o $HPCX_MPI_TESTS_DIR/examples/
hello_oshmem_c
% oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem_c
% module unload hpcx

Building HPC-X with the Intel Compiler Suite

Do take a look at the Mellanox HPC-X® ScalableHPC Software Toolkit

References:

  1. Mellanox HPC-X Software Toolkit User Manual 2.3
  2. Mellanox HPC-X® ScalableHPC Software Toolkit

Lenovo new 4-Socket Servers SR860-V2 and SR850-V2

This month, Lenovo launched 2 new mission-critical servers based on new 4-socket-capable third-generation Intel Xeon Scalable processors.

  • ThinkSystem SR860 V2, the new 4U 4-socket server, supporting up to 48x 2.5-inch drive bays and up to 8x NVIDIA T4 GPUs or 4x NVIDIA V100S GPUs.

  • ThinkSystem SR850 V2, the new 2U 4-socket server, supported up to 24x 2.5-inch drive bays, all of which can be NVMe if desired.

References:

 

General Linux OS Tuning for AMD EPYC

Step 1: Turn off swap
Turn off swap to prevent accidental swapping. Do not that disabling swap without sufficient memory can have undesired effects

swapoff -a

Step 2: Turn off NUMA balancing
NUMA balancing can have undesired effects and since it is possible to bind the ranks and memory in HPC, this setting is not needed

echo 0 > /proc/sys/kernel/numa_balancing

Step 3: Disable ASLR (Address Space Layout Ranomization) is a security feature used to prevent the exploitation of memory vulnerabilities

echo 0 > /proc/sys/kernel/randomize_va_space

Step 4: Set CPU governor to performance and disable cc6. Setting the CPU perfomance to governor to perfomrnaces ensures max performances at all times. Disabling cc6 ensures that deeper CPU sleep states are not entered.

cpupower frequency-set -g performance
Setting cpu: 0
Setting cpu: 1
.....
.....
cpupower idle-set -d 2
Idlestate 2 disabled on CPU 0
Idlestate 2 disabled on CPU 1
Idlestate 2 disabled on CPU 2
.....
.....

References:

  1. Tuning Guard for AMD EPYC (pdf)

Getting Useful Information on CPU and Configuration

Point 1. lscpu

To install

yum install util-linux

lscpu – (Print out information about CPU and its configuration)

[user1@myheadnode1 ~]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
Stepping: 4
CPU MHz: 3200.000
BogoMIPS: 6400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 25344K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Flags: fpu .................

Point 2: hwloc-ls

To install hwloc-ls

yum install hwloc

hwloc – (Prints out useful information about the NUMA locality of devices and general hardware locality information)

[user1@myheadnode1 ~]# hwloc-ls
Machine (544GB total)
NUMANode L#0 (P#0 256GB)
Package L#0 + L3 L#0 (25MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#16)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#17)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#18)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#19)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#20)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#21)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#22)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#23)
.....
.....
.....

Point 3 – Check whether the Boost is on for AMD

Print out if CPU boost is on or off

cat /sys/devices/system/cpu/cpufreq/boost
1

References:

  1. Tuning Guard for AMD EPYC (pdf)

BOIS settings for OEM Server with EPYC

Taken from Chapter 4 of https://developer.amd.com/wp-content/resources/56827-1-0.pdf

 

Selected Explanation of Setting. (See Document for FULL explanation)

1. Simultaneous Mult-Threading (SMT) or HyperThreading (HT)

  • IN HPC Workload, the SMT are usually turned off

2. x2APIC

  • This option helps the operating system deal with interrupt more efficiently in high cores count configuration. It is recommended to enable this option. This option must be enabled if  using more than 255 threads

3. Numa Per Socket (NPS)

  • In many HPC applications, ranks and memory can be pinned to cores and NUMA Nodes. The recommended value should be NPS4 option. However, if the workload is not NUMA aware or suffers when the NUMA complexity increase, we can experiment with NSP1.

4. Memory Frequency, Infinity Fabric Frequency, and coupled ve uncoupled mode

Memory Clock and Infinity Fabric Clock can run at synchronous frequencies (coupled mode) or at asynchronous frequencies (uncoupled mode)

  • If the memory is clocked at lower than 2933 MT/s, the memory and fabric will run in coupled mode which has the lowest memory latency
  • If the memory is clocked at  3200 MT/s, the memory and fabric clock will run in asynchronous mode has higher bandwidth but increased memory latency.
  • Make sure APBDIS is set to 1 and fixed SOC Pstate is set to P0

5. Preferred IO

Preferred IO allows one PCIe device in the system to be configured in a preferred mode. This device gets preferential treant on the infinity fabric

6. Determinism Slider

  • Recommended to choose Power Option. For this mode, the CPUs in the system performance at the maximum capability of each silicon device. Due to the natural variation existing during the manufacturing process, some CPUs performances may be varied,  but will never fall below “Performance Determinism mode”

 

Test Failed… does not support .pth files

If you are doing a setup.py with specific directories

python setup.py install --prefix=/home/user1

If you are getting a PythonPath Error something like this.

TEST FAILED: /home/user1/lib/python3.7/site-packages/ does NOT support .pth files error: bad install directory or PYTHONPATH

You are attempting to install a package to a directory that is not on PYTHONPATH and which Python does not read ".pth" files from. 
The installation directory you specified (via --install-dir, --prefix, or the distutils default setting) was: /home/user1/lib/python3.7/site-packages/

and your PYTHONPATH environment variable currently contains:

You can solve like by putting in your .bashrc.

export PYTHONPATH="${PYTHONPATH}:/home/user1/lib/python3.7/site-packages/"
source ~/.bashrc

Installing SCons-3.1.2 with Intel Python Distribution

What is SCons?

SCons is an Open Source software construction tool—that is, a next-generation build tool. Think of SCons as an improved, cross-platform substitute for the classic Make utility with integrated functionality similar to autoconf/automake and compiler caches such as ccache. In short, SCons is an easier, more reliable and faster way to build software.

For more information, see https://scons.org/

Prerequisites

Python 3 Distribution. For this I used the Intel Python Distribution

Get the Source Code

git clone https://github.com/SCons/scons.git

Setup scons directory and run the setup script

cd $HOME/scons
/usr/local/intel/2020/intelpython3/bin/python setup.py install

Do note that the scons will write a site package at /usr/local/intel/2020/intelpython3/lib/python3.7/site-packages/SCons-3.9.9a993-py3.7.egg . You need to allow the necessary permission

Testing the package

cd /usr/local/scons
python runtest.py SCons/BuilderTests.py
1/1 (100.00%) /usr/local/intel/2020/intelpython3//bin/python SCons/BuilderTests.py
......................................
----------------------------------------------------------------------
Ran 38 tests in 0.096s
OK

References:

  1. SCons: A software construction tool
  2. SCons GitHub Site

 

Turning ksm and ksmtuned off

In this blog, I will write on how to turn off KSM and ksmtuned since I do not need these services and save some unnecessary swapping activities on the disk.

What is KSM?

According to RedHat Site (8.4. KERNEL SAME-PAGE MERGING (KSM)),
Kernel same-page Merging (KSM), used by the KVM hypervisor, allows KVM guests to share identical memory pages. These shared pages are usually common libraries or other identical, high-use data. KSM allows for greater guest density of identical or similar guest operating systems by avoiding memory duplication……

KSM is a Linux feature which uses this concept in reverse. KSM enables the kernel to examine two or more already running programs and compare their memory. If any memory regions or pages are identical, KSM reduces multiple identical memory pages to a single page……

8.4.4 Kernel same-page merging (KSM) has a performance overhead which may be too large for certain environments or host systems. KSM may also introduce side channels that could be potentially used to leak information across guests. If this is a concern, KSM can be disabled on per-guest basis.

Deactivating KSM

# systemctl stop ksmtuned
Stopping ksmtuned:                                         [  OK  ]
# systemctl stop ksm
Stopping ksm:                                              [  OK  ]

To permanently deactivate KSM with the systemctl commands

# systemctl disable ksm
# systemctl disable ksmtuned

When KSM is disabled, any memory pages that were shared prior to deactivating KSM are still shared. To delete all of the PageKSM in the system, use the following command:

# echo 2 >/sys/kernel/mm/ksm/run

After this is performed, the khugepaged daemon can rebuild transparent hugepages on the KVM guest physical memory. Using # echo 0 >/sys/kernel/mm/ksm/run stops KSM, but does not unshare all the previously created KSM pages (this is the same as the # systemctl stop ksmtuned command).

References:

  1. Redhat – 8.4. Kernel Same-Page Merging (KSM)

Checking the Limits an application is imposed during run

If you wish to look at a specific application limits during run, you can do the following

pgrep fortcom
12345

* I used for fortcom, but it could be any application you wish to take a look.

cat /proc/12345/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 4096 2190327 processes
Max open files 1024 4096 files
Max locked memory unlimited unlimited bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 2190327 2190327 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

* You can take a look that there is no limits to Max locked Memory and Max file locks are unlimited.