Retrieving OpenMPI Configuration

If you need to find out the information on your configuration setting, you may want to use the below commands

$ ./ompi_info -all|grep 'command line'
 Configure command line: '--prefix=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ompi' '--with-libevent=internal' '--enable-mpi1-compatibility' '--without-xpmem' '--with-cuda=/hpc/local/oss/cuda12.1.1' '--with-slurm' '--with-platform=contrib/platform/mellanox/optimized' '--with-hcoll=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/hcoll' '--with-ucx=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ucx' '--with-ucc=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ucc'

If you wish to look at the full configuration

$ ./ompi_info
Package: Open MPI root@hpc-kernel-03 Distribution
Open MPI: 4.1.5rc2
Open MPI repo revision: v4.1.5rc1-17-gdb10576f40
Open MPI release date: Unreleased developer copy
Open RTE: 4.1.5rc2
Open RTE repo revision: v4.1.5rc1-17-gdb10576f40
Open RTE release date: Unreleased developer copy
OPAL: 4.1.5rc2
OPAL repo revision: v4.1.5rc1-17-gdb10576f40
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 4.1.5rc2
Prefix: /usr/local/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ompi
.....
.....
.....

Compiling OpenMPI-4.1.5 for ROCEv2 with GNU-8.5

https://docs.open-mpi.org/en/v5.0.x/release-notes/networks.html

Prerequisites 1

First thing first, You may want to check whether you are using RoCE. Do take a look at Installing RoCE using Mellanox (Nvidia) OFED package

Prerequisites 2

Do check whether you have ucx. You can do a dnf install

# dnf install ucx ucx-devel

Alternatively, you can do a manual install. For information on how to install, do take a look at http://openucx.org/wp-content/uploads/UCX_install_guide.pdf

# wget https://github.com/openucx/ucx/releases/download/v1.4.0/ucx-1.4.0.tar.gz
$ tar xzf ucx-1.4.0.tar.gz
$ cd ucx-1.4.0
$ ./contrib/configure-release --prefix=/usr/local/ucx-1.4.0
$ make -j8 
$ make install

Prerequisites 3

Make sure you have install GNU and GNU-C++. This can be done easily using the

# dnf install gcc-c++ gcc

Step 1: Download the OpenMPI package

You can go to OpenMPI to download the latest package at (https://www.open-mpi.org/software/ompi/v4.1/). The latest one at the point of writing is OpenMPI-4.1.

Step 2: Compile the Package

$ ./configure --prefix=/usr/local/openmpi-4.1.5 --enable-mpi-cxx --with-devel-headers --with-ucx --with-verbs --with-slurm=no
$ make && make install

Step 3: To run the MPIRUN using ROCE, do the following.

You may want to see Network Support Information on OpenMPI

$ mpirun --np 12 --hostfile path/to/hostfile --mca pml ucx -x -x UCX_NET_DEVICES=mlx5_0:1 ........

References:

  1. Setting up a RoCE cluster
  2. OpenMPI – Network Support
  3. How do I run Open MPI over RoCE? (UCX PML)

Compiling OpenMPI-3.1.6 with GCC-6.5

We assumed that you have installed GNU 6.5 and isl-0.15

Download the latest OpenMPI 3.1.6 package from OpenMPI site

% ./configure --prefix=/usr/local/gnu/openmpi-3.1.6 --enable-orterun-prefix-by-default --enable-mpi-cxx --enable-openib-rdmacm-ibaddr --enable-mca-no-build=btl-uct

–enable-orterun-prefix-by-default (Configure OMPI –enable-orterun-prefix-by-default and so that you do not need to add the prefix option)
–enable-openib-rdmacm-ibaddr (To enable routing over IB)
–enable-mpi-cxx (C++ bindings are no more built by default)
–enable-mca-no-build=btl-uct (ecent OpenMPI versions contain a BTL component called ‘uct’, which can cause data corruption when enabled, due to conflict on malloc hooks between OPAL and UCM.)

% make all install | tee install.log

References:

  1. Intel Community – Caught Signal 11 (Segmentation Fault: Does not mapped to object at)
  2. Open MPI + Scalasca :Can not run mpirun command with option –prefix?

Compiling OpenMPI-1.8.8 with Intel Compiler and CUDA

Configuration parameters for Compiling OpenMPI-1.8.8 with Intel Compiler and CUDA can be found here.

# ./configure --prefix=/usr/local/openmpi-1.8.8-gpu_intel-15.0.7 CC=icc CXX=icpc F77=ifort FC=ifort --with-devel-headers --enable-binaries --with-cuda=/usr/local/cuda/
# make -j 16
# make all

References:

  1. Compiling OpenMPI 1.6.5 with Intel 12.1.5 on CentOS 6
  2. Building OpenMPI Libraries for 64-bit integers

Debugging Tools to track run-time errors for mpirun

If you are having with unexplained issues with mpirun, you can use various method to troubleshoot.

Information on “–mca orte_base_help_aggregate 0”

If your mpirun dies without any error messages  you may want to take read from OpenMPI FAQ which
Debugging applications in parallel 7. My process dies without any output. Why?

If your application fails due to memory corruption, Open MPI may subsequently fail to output an error message before dying. Specifically, starting with v1.3, Open MPI attempts to aggregate error messages from multiple processes in an attempt to show unique error messages only once (vs. one for each MPI process — which can be unweildly, especially when running large MPI jobs).

However, this aggregation process requires allocating memory in the MPI process when it displays the error message. If the process’ memory is already corrupted, Open MPI’s attempt to allocate memory may fail and the process will simply die, possibly silently. When Open MPI does not attempt to aggregate error messages, most of its setup work is done during MPI_INIT and no memory is allocated during the “print the error” routine. It therefore almost always successfully outputs error messages in real time — but at the expense that you’ll potentially see the same error message for each MPI process that encourntered the error.

Hence, the error message aggregation is usually a good thing, but sometimes it can mask a real error. You can disable Open MPI’s error message aggregation with the orte_base_help_aggregate MCA parameter. For example:

 $ mpirun --mca orte_base_help_aggregate 0 ...

Compiling and installing FFTW 3.3.3 with OpenMPI and OpenMP

This Blog entry is an extension of the Compiling and installing FFTW 3.3.3

To Compile FFTW 3.3.3 single precision with OpenMPI, make sure you compile and set your path for Intel and OpenMPI. You may want to get more information from Compiling OpenMPI 1.6.5 with Intel 12.1.5 on CentOS 6

Step 1: Compiling FFTW 3.3.3 (Single Precision) with OpenMPI and OpenMP

After unpacking FFTW 3.3.3, you may want to use the flags

# ./configure CC=icc 
--enable-float --enable-threads --enable-openmp \
--enable-mpi MPICC=mpicc \
LDFLAGS=-L/usr/local/openmpi/intel/lib CPPFLAGS=-I/usr/local/openmpi/intel/include \ 
--prefix=/usr/local/fftw-3.3.3-single
# make -j8
# make install

Inside /usr/local/fftw-3.3.3-single/lib, you should see at least the files below

libfftw3f.a
libfftw3f_mpi.a
libfftw3f_omp.a
libfftw3f_threads.a
....
....

Step 2: Compiling FFTW 3.3.3 (Double Precision) with OpenMPI and OpenMP

# ./configure CC=icc 
--enable-threads --enable-openmp \
--enable-mpi MPICC=mpicc \
LDFLAGS=-L/usr/local/openmpi/intel/lib CPPFLAGS=-I/usr/local/openmpi/intel/include \ 
--prefix=/usr/local/fftw-3.3.3-single
# make -j8
# make install

Inside /usr/local/fftw-3.3.3-double/lib, you should see at least the files below

libfftw3.a
libfftw3_mpi.a
libfftw3_omp.a
libfftw3_threads.a
....
....

Compiling Open Tool for Parameter Optimization (otpo)

Taken from Open Tool for Parameter Optimization (otpo) Documentation

OTPO (Open Tool for Parameter Optimizations) is an Open MPI specific tool that is meant to explore the MCA parameter space. In Open MPI, the user can specify at runtime many values for MCA parameters, try e.g. (ompi_info –param all all). Alternatively, to focus on a single aspect of parameters, e.g. the parameters of the OpenIB BTL (ompi_info –param btl openib).

OTPO is a tool that takes in a list of any MCA parameters, with a user specified range of values for those parameters, and for every combination of the MCA parameter values, OTPO executes an MPI job, measuring execution time (the only measurement available right now), bandwidth, etc.. The tests used for the measurements are modular. Right now, OTPO supports

Netpipe
Skampi
NAS Parallel Benchmarks

OTPO outputs a list of the best parameter combinations for a certain test.

The main purpose of OTPO is to explore the effect of the MCA parameters on different machines with different architectures and configurations, and explore the dependencies between the MCA parameters themselves. OTPO is meant to run on the head node of a cluster, and it forks MPI jobs after exporting the current combination of MCA parameters on the nodes.

OTPO is built on top of ADCL (Abstract Data and Communication Library). ADCL is an application level communication library aiming at providing the highest possible performance for application level communication operations in a given execution environment. OTPO uses ADCL to provide the runtime selection logic and choosing the best combination of parameters.


A. Compiling OTPO

Step 1: Make sure your PATH and LD_LIBRARY_PATH has the necessary pathing for the OpenMPI and Compilers

In your .bashrc, it should look something like

export PATH=$PATH:/usr/local/intel/composerxe/bin:/usr/local/openmpi-1.6.5-intel-v12.1.5/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/mpc-0.8.1/lib:/usr/local/intel/composerxe/lib/intel64:/usr/local/intel/composerxe/mkl/lib/intel64:/usr/local/openmpi-1.6.5-intel-v12.1.5/lib

Step 2: Download and compile ADCL within OTPO

# wget http://www.open-mpi.org/software/otpo/v1.0/downloads/otpo-1.0.tar.gz
# tar -zxvf otpo-1.0.tar.gz
# cd otpo
# cd ADCL
# ./configure --prefix=/usr/local/ADCL
# make
# make install
# cp /root/otpo/ADCL/include/*.h /usr/local/ADCL

After the compilation, you should see libadcl.a inside the lib folder of ADCL. Remember to copy the ADCL headers to the /usr/local/ADCL/include

Step 3: Compile OTPO

# ./autogen.sh
# ./configure --prefix=/usr/local/otpo-1.0 --with-adcl-dir=/usr/local/ADCL
# make
# make install

Compiling OpenMPI 1.6.5 with Intel 12.1.5 on CentOS 6

Step 1: Download the OpenMPI Software from http://www.open-mpi.org/ . The current stable version at point of writing is OpenMPI 1.6.5

Step 2: Download and Install the Intel Compilers from Intel Website. More information can be taken from Free Non-Commercial Intel Compiler Download

Step 3: Add the Intel Directory Binary Path to the Bash Startup

At my ~/.bash_profile directory, I've added
source /usr/local/intel/composerxe/bin/compilervars.sh intel64
export PATH=$PATH:/usr/local/intel/composerxe/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/RH6_apps/intel/composerxe/lib/intel64:/usr/local/intel/composerxe/mkl/lib/intel64

At command prompt

# source .bashrc

Step 4: Configuration Information

# gunzip -c openmpi-1.6.5.tar.gz tar xf -
# cd openmpi-1.6.5
# /configure --prefix=/usr/local/openmpi-1.6.5-intel-v12.1.5 CC=icc CXX=icpc F77=ifort FC=ifort --with-devel-headers --enable-binaries
# make all install

You will need to include –devel-headers –enable-binaries” or you will face the error Cannot open OpenMPI Configuration mpif90-wrapper-data.txt

Step 5: Test the configuration

$ mpicc -v
cc version 12.1.5 (gcc version 4.4.6 compatibility)

Building OpenMPI Libraries for 64-bit integers

There is an excellent article on how to build OpenMPI Libraries for 64-bit integers. For more detailed information, do look at How to build MPI libraries for 64-bit integers

The information on this website is taken from the above site.

Step 1: Check the integer size. Do the following. If the output is as below, you have to compile OpenMPI with 64 bits.

# ompi_info -a | grep 'Fort integer size'
Fort integer size: 4

Intel Compilers Step 2a: To compile OpenMPI with Intel Compilers and with 64-bits integers, do the following:

# ./configure --prefix=/usr/local/openmpi CXX=icpc CC=icc 
F77=ifort FC=ifort FFLAGS=-i8 FCFLAGS=-i8
# make -j 8
# make install

* GNU Compilers

Step 2b: To compile OpenMPI with GNU Compilers and with 64-bits, do the followings:

# ./configure --prefix=/usr/local/openmpi CXX=g++ CC=gcc F77=gfortran FC=gfortran \
FFLAGS="-m64 -fdefault-integer-8"       \
CFLAGS="-m64 -fdefault-integer-8"      \
CFLAGS=-m64                             \
CXXFLAGS=-m64
# make -j 8
# make install

Step 3: Update your PATH and LD_LIBRARY_PATH in your .bashrc

export $PATH=/usr/local/openmpi/bin:$PATH 
export $LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH

Verify that the installation is correct

# ompi_info -a | grep 'Fort integer size' 
Fort integer size: 8

Registering sufficent memory for OpenIB when using Mellanox HCA

If you encountered errors like “error registering openib memory” similar to what is written below. You may want to take a look at the OpenMPI FAQ – I’m getting errors about “error registering openib memory”; what do I do? .

WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              node02
  Registerable memory:     32768 MiB
  Total memory:            65476 MiB

Your MPI job will continue, but may be behave poorly and/or hang.

The explanation solution can be found at How to increase MTT Size in Mellanox HCA.

In summary, the error occurred when applications which consumed a large amount of memory, application might fail when not enough memory can be registered with RDMA. There is a need to increase MTT size. But increasing MTT size hasve the downside of increasing the number of “cache misses” and increases latency.

1. To check your value of log_num_mtt

# cat /sys/module/mlx4_core/parameters/log_num_mtt

2. To check your value of log_mtts_per_seg

# cat /sys/module/mlx4_core/parameters/log_mtts_per_seg

There are 2 parameters that affect registered memory. This can be taken from http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

With Mellanox hardware, two parameters are provided to control the size of this table:

  • log_num_mtt (on some older Mellanox hardware, the parameter may benum_mtt, not log_num_mtt): number of memory translation tables
  • log_mtts_per_seg:

The amount of memory that can be registered is calculated using this formula:

In newer hardware:
    max_reg_mem = (2^log_num_mtt) * (2^log_mtts_per_seg) * PAGE_SIZE

In older hardware: 
    max_reg_mem = num_mtt * (2^log_mtts_per_seg) * PAGE_SIZE

For example if your server’s Physical Memory is 64GB RAM. You will need to registered 2 times the Memory (2x64GB=128GB) for the max_reg_mem. You will also need to get the PAGE_SIZE (See Virtual Memory PAGESIZE on CentOS)

max_reg_mem = (2^ log_num_mtt) * (2^3) * (4 kB)
128GB = (2^ log_num_mtt) * (2^3) * (4 kB)
2^37 = (2^ log_num_mtt) * (2^3) * (2^12)
2^22 = (2^ log_num_mtt)
22 = log_num_mtt

The setting is found in /etc/modprobe.d/mlx4_mtt.conf for every nodes.

References:

  1. OpenMPI FAQ – 15. I’m getting errors about “error registering openib memory”; what do I do?
  2. How to increase MTT Size in Mellanox HCA
  3. OpenMPI FAQ – 18. Open MPI is warning me about limited registered memory; what does this mean?
  4. Virtual Memory PAGESIZE on CentOS