Unable to run hydra_bstrap_proxy when using mpiexec

If you are facing an issue similar to this error and the reasons provided are:

  1. Host is unavailable. Please check that all hosts are available.
  2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
  3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
  4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.
[mpiexec@hpc-node1] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on hpc-npriv-g001 (pid 2778558, exit code 256)
[mpiexec@hpc-node1] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@hpc-node1] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@hpc-node1] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1065): error waiting for event
[mpiexec@hpc-node1] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@hpc-node1] Possible reasons:
[mpiexec@hpc-node1] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@hpc-node1] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@hpc-node1] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@hpc-node1] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.

The Solution is found by modifying your mpiexec commands

$ mpiexec -bootstrap ssh ......

For example

$ mpiexec -bootstrap ssh python3 python.text

Alternatively, you can put the line in your .bashrc or PBS Script

export I_MPI_HYDRA_BOOTSTRAP=ssh

References:

Retrieving OpenMPI Configuration

If you need to find out the information on your configuration setting, you may want to use the below commands

$ ./ompi_info -all|grep 'command line'
 Configure command line: '--prefix=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ompi' '--with-libevent=internal' '--enable-mpi1-compatibility' '--without-xpmem' '--with-cuda=/hpc/local/oss/cuda12.1.1' '--with-slurm' '--with-platform=contrib/platform/mellanox/optimized' '--with-hcoll=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/hcoll' '--with-ucx=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ucx' '--with-ucc=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ucc'

If you wish to look at the full configuration

$ ./ompi_info
Package: Open MPI root@hpc-kernel-03 Distribution
Open MPI: 4.1.5rc2
Open MPI repo revision: v4.1.5rc1-17-gdb10576f40
Open MPI release date: Unreleased developer copy
Open RTE: 4.1.5rc2
Open RTE repo revision: v4.1.5rc1-17-gdb10576f40
Open RTE release date: Unreleased developer copy
OPAL: 4.1.5rc2
OPAL repo revision: v4.1.5rc1-17-gdb10576f40
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 4.1.5rc2
Prefix: /usr/local/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ompi
.....
.....
.....

Compiling OpenMPI-4.1.5 for ROCEv2 with GNU-8.5

https://docs.open-mpi.org/en/v5.0.x/release-notes/networks.html

Prerequisites 1

First thing first, You may want to check whether you are using RoCE. Do take a look at Installing RoCE using Mellanox (Nvidia) OFED package

Prerequisites 2

Do check whether you have ucx. You can do a dnf install

# dnf install ucx ucx-devel

Alternatively, you can do a manual install. For information on how to install, do take a look at http://openucx.org/wp-content/uploads/UCX_install_guide.pdf

# wget https://github.com/openucx/ucx/releases/download/v1.4.0/ucx-1.4.0.tar.gz
$ tar xzf ucx-1.4.0.tar.gz
$ cd ucx-1.4.0
$ ./contrib/configure-release --prefix=/usr/local/ucx-1.4.0
$ make -j8 
$ make install

Prerequisites 3

Make sure you have install GNU and GNU-C++. This can be done easily using the

# dnf install gcc-c++ gcc

Step 1: Download the OpenMPI package

You can go to OpenMPI to download the latest package at (https://www.open-mpi.org/software/ompi/v4.1/). The latest one at the point of writing is OpenMPI-4.1.

Step 2: Compile the Package

$ ./configure --prefix=/usr/local/openmpi-4.1.5 --enable-mpi-cxx --with-devel-headers --with-ucx --with-verbs --with-slurm=no
$ make && make install

Step 3: To run the MPIRUN using ROCE, do the following.

You may want to see Network Support Information on OpenMPI

$ mpirun --np 12 --hostfile path/to/hostfile --mca pml ucx -x -x UCX_NET_DEVICES=mlx5_0:1 ........

References:

  1. Setting up a RoCE cluster
  2. OpenMPI – Network Support
  3. How do I run Open MPI over RoCE? (UCX PML)

The 9th Annual MVAPICH User Group (MUG) conference

The 9th Annual MVAPICH User Group (MUG) conference will be held virtually with free registration from August 23-25, 2021. An exciting program has been put together with the following highlights:

  • Two Keynote Talks by Luiz DeRose from Oracle and Gilad Shainer from NVIDIA
  • Seven Tutorials/Demos (AWS, NVIDIA, Oracle, Rockport Networks, X-ScaleSolutions, and The Ohio State University)
  • 16 Invited Talks from many organizations (LLNL, INL, Broadcom, Rockport Networks, Microsoft Azure, AWS, Paratools and University of Oregon, CWRU, SDSC, TACC, KISTI, Konkuk University, UTK, Redhat, NSF, X-ScaleSolutions, and OSC)
  • 12 Short Presentations from the MVAPICH2 project members
  • A talk on the Future Roadmap of the MVAPICH2 Project
  • A special session on the newly funded NSF AI Institute – ICICLE (https://icicle.osu.edu/)

The detailed program is available from http://mug.mvapich.cse.ohio-state.edu/program/

All interested parties are welcome to attend the event free. The registration link is available from the following link: http://mug.mvapich.cse.ohio-state.edu/registration/

Compiling OpenMPI-3.1.6 with GCC-6.5

We assumed that you have installed GNU 6.5 and isl-0.15

Download the latest OpenMPI 3.1.6 package from OpenMPI site

% ./configure --prefix=/usr/local/gnu/openmpi-3.1.6 --enable-orterun-prefix-by-default --enable-mpi-cxx --enable-openib-rdmacm-ibaddr --enable-mca-no-build=btl-uct

–enable-orterun-prefix-by-default (Configure OMPI –enable-orterun-prefix-by-default and so that you do not need to add the prefix option)
–enable-openib-rdmacm-ibaddr (To enable routing over IB)
–enable-mpi-cxx (C++ bindings are no more built by default)
–enable-mca-no-build=btl-uct (ecent OpenMPI versions contain a BTL component called ‘uct’, which can cause data corruption when enabled, due to conflict on malloc hooks between OPAL and UCM.)

% make all install | tee install.log

References:

  1. Intel Community – Caught Signal 11 (Segmentation Fault: Does not mapped to object at)
  2. Open MPI + Scalasca :Can not run mpirun command with option –prefix?

Release of MVAPICH2 2.3.4 GA and OSU Micro-Benchmarks (OMB) 5.6.3

The MVAPICH team is pleased to announce the release of MVAPICH2 2.3.4 GA and OSU Micro-Benchmarks (OMB) 5.6.3.

Features and enhancements for MVAPICH2 2.3.4 GA are as follows:

* Features and Enhancements (since 2.3.3):

  • Improved performance for small message collective operations
  • Improved performance for data transfers from/to non-contiguous buffers used by user-defined datatypes
  • Add custom API to identify if MVAPICH2 has in-built CUDA support
  • New API ‘MPIX_Query_cuda_support’ defined in mpi-ext.h
    • New macro ‘MPIX_CUDA_AWARE_SUPPORT’ defined in mpi-ext.h
  • Add support for MPI_REAL16 based reduction operations for Fortran programs
    • MPI_SUM, MPI_MAX, MPI_MIN, MPI_LAND, MPI_LOR, MPI_MINLOC, and MPI_MAXLOC
    • Thanks to Greg Lee@LLNL for the report and reproduced
    • Thanks to Hui Zhou@ANL for the initial patch
  • Add support to intercept aligned_alloc in ptmalloc
    • Thanks to Ye Luo @ANL for the report and the reproduced
  • Add support to enable fork safety in MVAPICH2 using environment variable
    • “MV2_SUPPORT_FORK_SAFETY”
  • Add support for user to modify QKEY using environment variable
    • “MV2_DEFAULT_QKEY”
  • Add multiple MPI_T PVARs and CVARs for point-to-point and collective operations
  • Enhanced point-to-point and collective tuning for AMD EPYC Rome, Frontera@TACC, Longhorn@TACC, Mayer@Sandia, Pitzer@OSC, Catalyst@EPCC, Summit@ORNL, Lassen@LLNL, and Sierra@LLNL systems
  • Give preference to CMA if LiMIC2 and CMA are enabled at the same time
  • Move -lmpi, -lmpicxx, and -lmpifort before other LDFLAGS in compiler wrappers like mpicc, mpicxx, mpif77, and mpif90
  • Allow passing flags to nvcc compiler through environment variable NVCCFLAGS
  • Display more meaningful error messages for InfiniBand asynchronous events
  • Add support for AMD Optimizing C/C++ (AOCC) compiler v2.1.0
  • Add support for GCC compiler v10.1.0
    • Requires setting FFLAGS=-fallow-argument-mismatch at configure time
  • Update to hwloc v2.2.0

 

* Bug Fixes (since 2.3.3):

  • Fix compilation issue with IBM XLC++ compilers and CUDA 10.2
  • Fix hangs with MPI_Get operations win UD-Hybrid mode
  • Initialize MPI3 data structures correctly to avoid random hangs caused by garbage values
  • Fix corner case with LiMIC2 and MPI3 one-sided operations
  • Add proper fallback and warning message when shared RMA window cannot be created
  • Fix race condition in calling mv2_get_path_rec_sl by introducing mutex
    • Thanks to Alexander Melnikov for reporting the issue and providing the patch
  • Fix mapping generation for the cases where hwloc returns zero on non-numa machines
    • Thanks to Honggang Li @Red Hat for the report and initial patch
  • Fix issues with InfiniBand registration cache and PGI20 compiler
  • Fix warnings raised by Coverity scans
    • Thanks to Honggang Li @Red Hat for the report
  • Fix bad baseptr address returned from MPI_Win_shared_query
    • Thanks to Adam Moody@LLNL for the report and discussion
  • Fix issues with HCA selection logic in heterogeneous multi-rail scenarios
  • Fix spelling mistake in error message
    • Thanks to Bill Long and Krishna Kandalla @Cray/HPE for the report
  • Fix compilation warnings and memory leaks

 

New features, enhancements and bug fixes for OSU Micro-Benchmarks

(OMB) 5.6.3 are listed here

* New Features & Enhancements (since v5.6.2)

  • Add support for benchmarking applications that use ‘fork’ system call

* osu_latency_mp

 

* Bug Fixes (since v5.6.2)

  • Fix compilation issue with IBM XLC++ compilers and CUDA 10.2
  • Allow passing flags to nvcc compiler
  • Fix issues in window creation with host-to-device and device-to-host transfers for one-sided tests

For downloading MVAPICH2 2.3.4 GA, OMB 5.6.3, and associated user guides, quick start guide, and accessing the SVN, please visit the following URL:

http://mvapich.cse.ohio-state.edu

Compiling OpenMPI-1.8.8 with Intel Compiler and CUDA

Configuration parameters for Compiling OpenMPI-1.8.8 with Intel Compiler and CUDA can be found here.

# ./configure --prefix=/usr/local/openmpi-1.8.8-gpu_intel-15.0.7 CC=icc CXX=icpc F77=ifort FC=ifort --with-devel-headers --enable-binaries --with-cuda=/usr/local/cuda/
# make -j 16
# make all

References:

  1. Compiling OpenMPI 1.6.5 with Intel 12.1.5 on CentOS 6
  2. Building OpenMPI Libraries for 64-bit integers

Debugging Tools to track run-time errors for mpirun

If you are having with unexplained issues with mpirun, you can use various method to troubleshoot.

Information on “–mca orte_base_help_aggregate 0”

If your mpirun dies without any error messages  you may want to take read from OpenMPI FAQ which
Debugging applications in parallel 7. My process dies without any output. Why?

If your application fails due to memory corruption, Open MPI may subsequently fail to output an error message before dying. Specifically, starting with v1.3, Open MPI attempts to aggregate error messages from multiple processes in an attempt to show unique error messages only once (vs. one for each MPI process — which can be unweildly, especially when running large MPI jobs).

However, this aggregation process requires allocating memory in the MPI process when it displays the error message. If the process’ memory is already corrupted, Open MPI’s attempt to allocate memory may fail and the process will simply die, possibly silently. When Open MPI does not attempt to aggregate error messages, most of its setup work is done during MPI_INIT and no memory is allocated during the “print the error” routine. It therefore almost always successfully outputs error messages in real time — but at the expense that you’ll potentially see the same error message for each MPI process that encourntered the error.

Hence, the error message aggregation is usually a good thing, but sometimes it can mask a real error. You can disable Open MPI’s error message aggregation with the orte_base_help_aggregate MCA parameter. For example:

 $ mpirun --mca orte_base_help_aggregate 0 ...