Building LAMMPS using CMAKE with OpenMPI on Rocky Linux 8

What is LAMMPS (briefly)?

LAMMPS is a classical molecular dynamics code with a focus on materials modeling. It’s an acronym for Large-scale Atomic/Molecular Massively Parallel Simulator. More Information on the software, do take a look at

Where to Download?

You can download the latest stable from Download LAMMPS

Step 1: Ensure Prerequisites are present

Step 2: Download and build LAMMPS

For more information, do take a look at

$ tar -zxvf lammps-stable.tar.gz
$ cd lammps-2Aug203
$ mkdir build
$ touch
$ vim

Inside the

cmake   -C ../cmake/presets/most.cmake ../cmake             \
-D CMAKE_INSTALL_PREFIX=/usr/local/lammps-2Aug2023 \
-D FFTW3_LIBRARIES=${FFTW_LIB}/libfftw3_mpi.a

The -C ../cmake/presets/most.cmake command adds the packages that don’t need extra libraries.

Make and Compile……

$ make -j 16 
$ make install


  1. Compiling LAMMPS and using it with Python
  2. Building LAMMPS using CMake

Compiling FFTW-3.3.10 with OpenMPI on Rocky Linux 8

For detailed explanation and information, do take a look at FFTW Installation on UNIX. For my installation.

We will be focusing on using Nvidia hpcx only for this blog. To compile Nvidia hpcx, do take a look at Installing and using Mellanox HPC-X Software Toolkit

You may want to module use which come in the hpcx installation

export HPCX_HOME=/usr/local/hpcx-v2.15-gcc-MLNX_OFED_LINUX-5-redhat8-cuda12-gdrcopy2-nccl2.17-x86_64
module use $HPCX_HOME/modulefiles

Next, I used the following parameters that suit my HPC Environment. The default installation is already double-precision. I needed MPI, OPenMPI and needs AVX512…..

# ./configure --prefix=/usr/local/fftw-3.3.10 --enable-threads --enable-openmp --enable-mpi --enable-avx512
# make && make install


  1. FFTW Installation on UNIX

Installing ORCA-5.0.4 on Rocky Linux 8 with OpenMPI

ORCA is a general-purpose quantum chemistry package that is free of charge for academic users. The Project and Download Website can be found at ORCA Forum. The current version is 5.0.4.

The current prerequisites that I have used were OpenMPI-4.1.1 and System GNU which is 8.5.

Unless I have missed something, the packages of ORCA-5.0.4 has been split into 3 different packages which you have to untar and combine together

  • orca_5_0_4_linux_x86-64_openmpi411_part1
  • orca_5_0_4_linux_x86-64_openmpi411_part2
  • orca_5_0_4_linux_x86-64_openmpi411_part3

How do I untar the packages?

The first thing is to untar all the packages separately first. Assuming you are untarring at the /usr/local/

$ tar -xf orca_5_0_4_linux_x86-64_openmpi411_part1.tar.xz
$ tar -xf orca_5_0_4_linux_x86-64_openmpi411_part2.tar.xz
$ tar -xf orca_5_0_4_linux_x86-64_openmpi411_part3.tar.xz

How do I do with all the untarred packages?

Copy all the untar files into /usr/local/orca-5.0.4.

cp -rv ../orca_5_0_4_linux_x86-64_openmpi411_part1/* .
cp -rv ../orca_5_0_4_linux_x86-64_openmpi411_part2/* .
cp -rv ../orca_5_0_4_linux_x86-64_openmpi411_part3/* .

How to Compile OpenMPI-4.1.1?

Although the Compiling OpenMPI-4.1.5 for ROCEv2 with GNU-8.5 is of a higher version of OpenMPI, the principle and parameters can still be used.

How do I Put them Together?

If you are not using the Module Environment, you can consider installing. For more information do take a look at Installing Environment Modules on Rocky Linux 8.5. All you need to do is then is to load the additional module such as OpenMPI as a prerequisites. Alternatively, you can set the PATH, LD_LIBRARY_PATH of OpenMPI something like this.

export OPENMPI_HOME=/usr/local/openmpi-4.1.1
export PATH=$PATH:/usr/local/orca-5.0.4

If you are using without Module Environment, you may want to



  1. Installing ORCA

Enabling Nvidia Tesla 4 x A100 with NVLink for MPI

I was having issues with the Applications like NetKET to detect and enable MPI.


  1. I have installed OpenMPI and enabled CUDA during the configuration.
  2. CUDA Libraries including nvidia-smi has been installed without issue. But running, nvidia-smi topo –matrix, I am not able to see NVLink similar to

In fact, when I run NetKet on CUDA with MPI, the error that was generated was

mpirun noticed that process rank 0 with PID 0 on node gpu1 exited on signal 11 (Segmentation fault)."


This forum entry provided some enlightenment.

The solution was to disable the Multi-instance GPU Mode which is enabled by default. Reboot the Server and it should see

nvidia-smi -mig 0

Enabling Persistence Mode

Make sure the configuration stays after a reboot.

# systemctl enable nvidia-persistenced.service
# systemctl start nvidia-persistenced.service

Retrieving OpenMPI Configuration

If you need to find out the information on your configuration setting, you may want to use the below commands

$ ./ompi_info -all|grep 'command line'
 Configure command line: '--prefix=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ompi' '--with-libevent=internal' '--enable-mpi1-compatibility' '--without-xpmem' '--with-cuda=/hpc/local/oss/cuda12.1.1' '--with-slurm' '--with-platform=contrib/platform/mellanox/optimized' '--with-hcoll=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/hcoll' '--with-ucx=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ucx' '--with-ucc=/build-result/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ucc'

If you wish to look at the full configuration

$ ./ompi_info
Package: Open MPI root@hpc-kernel-03 Distribution
Open MPI: 4.1.5rc2
Open MPI repo revision: v4.1.5rc1-17-gdb10576f40
Open MPI release date: Unreleased developer copy
Open RTE: 4.1.5rc2
Open RTE repo revision: v4.1.5rc1-17-gdb10576f40
Open RTE release date: Unreleased developer copy
OPAL: 4.1.5rc2
OPAL repo revision: v4.1.5rc1-17-gdb10576f40
OPAL release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 4.1.5rc2
Prefix: /usr/local/hpcx-v2.16-gcc-mlnx_ofed-redhat8-cuda12-gdrcopy2-nccl2.18-x86_64/ompi

Compiling OpenMPI-4.1.5 for ROCEv2 with GNU-8.5

Prerequisites 1

First thing first, You may want to check whether you are using RoCE. Do take a look at Installing RoCE using Mellanox (Nvidia) OFED package

Prerequisites 2

Do check whether you have ucx. You can do a dnf install

# dnf install ucx ucx-devel

Alternatively, you can do a manual install. For information on how to install, do take a look at

# wget
$ tar xzf ucx-1.4.0.tar.gz
$ cd ucx-1.4.0
$ ./contrib/configure-release --prefix=/usr/local/ucx-1.4.0
$ make -j8 
$ make install

Prerequisites 3

Make sure you have install GNU and GNU-C++. This can be done easily using the

# dnf install gcc-c++ gcc

Step 1: Download the OpenMPI package

You can go to OpenMPI to download the latest package at ( The latest one at the point of writing is OpenMPI-4.1.

Step 2: Compile the Package

$ ./configure --prefix=/usr/local/openmpi-4.1.5 --enable-mpi-cxx --with-devel-headers --with-ucx --with-verbs --with-slurm=no
$ make && make install

Step 3: To run the MPIRUN using ROCE, do the following.

You may want to see Network Support Information on OpenMPI

$ mpirun --np 12 --hostfile path/to/hostfile --mca pml ucx -x -x UCX_NET_DEVICES=mlx5_0:1 ........


  1. Setting up a RoCE cluster
  2. OpenMPI – Network Support
  3. How do I run Open MPI over RoCE? (UCX PML)

Encountering shm_open permission denied issues with hpcx

If you are using Nvidia hpc-x and encountering issues like the one below during your MPI Run

shm_open(file_name=/ucx_shm_posix_77de2cf3 flags=0xc2) failed: Permission denied

The error message indicates that the shared memory has no permission to be used,  The permission of /dev/shm is found to be 755, not 777, causing the error. The issue can be resolved after the permission is changed to 777. To change and verify the changes:

% chmod 777 /dev/shm 
% ls -ld /dev/shm
drwxrwxrwx 2 root root 40 Jul  6 15:18 /dev/sh

Installing CP2K with Nvidia HPCX on Rocky Linux 8.5

What is HPCX?

NVIDIA® HPC-X® is a comprehensive software package that includes Message Passing Interface (MPI), Symmetrical Hierarchical Memory (SHMEM) and Partitioned Global Address Space (PGAS) communications libraries, and various acceleration packages. For more information, do take a look at

What is CP2K?

CP2K is a quantum chemistry and solid state physics software package that can perform atomistic simulations of solid state, liquid, molecular, periodic, material, crystal, and biological systems. CP2K provides a general framework for different modeling methods such as DFT using the mixed Gaussian and plane waves approaches GPW and GAPW. Supported theory levels include DFTB, LDA, GGA, MP2, RPA, semi-empirical methods (AM1, PM3, PM6, RM1, MNDO, …), and classical force fields (AMBER, CHARMM, …). CP2K can do simulations of molecular dynamics, metadynamics, Monte Carlo, Ehrenfest dynamics, vibrational analysis, core level spectroscopy, energy minimisation, and transition state optimization using NEB or dimer method. (Detailed overview of features.). For more information, do take a look at

Getting the CP2K

git clone --recursive cp2k

Unpack hpcx and Optimised OpenMPI Libraries. For more information on installation, do take a look at Installing and Loading HPC-X

Extract hpcx.tbz into your current working directory.

% tar -xvf hpcx.tbz
% cd hpcx
% export HPCX_HOME=$PWD
% module use $HPCX_HOME/modulefiles
% module load hpcx

Use the CP2K Toolchain to Compile for the easiest

% cd cp2k
% cd /usr/local/software/cp2k/tools/toolchain
% ./ --no-check-certificate --with-openmpi --with-sirius=no

Compiling the CP2K

==================== generating arch files ====================
arch files can be found in the /usr/local/software/cp2k/tools/toolchain/install/arch subdirectory
Wrote /usr/local/software/cp2k/tools/toolchain/install/arch/local.ssmp
Wrote /usr/local/software/cp2k/tools/toolchain/install/arch/local_static.ssmp
Wrote /usr/local/software/cp2k/tools/toolchain/install/arch/local.sdbg
Wrote /usr/local/software/cp2k/tools/toolchain/install/arch/local_coverage.sdbg
Wrote /usr/local/software/cp2k/tools/toolchain/install/arch/local.psmp
Wrote /usr/local/software/cp2k/tools/toolchain/install/arch/local.pdbg
Wrote /usr/local/software/cp2k/tools/toolchain/install/arch/local_static.psmp
Wrote /usr/local/software/cp2k/tools/toolchain/install/arch/local_warn.psmp
Wrote /usr/local/software/cp2k/tools/toolchain/install/arch/local_coverage.pdbg
========================== usage =========================
Now copy:
  cp /usr/local/software/cp2k/tools/toolchain/install/arch/* to the cp2k/arch/ directory
To use the installed tools and libraries and cp2k version
compiled with it you will first need to execute at the prompt:
  source /usr/local/software/cp2k/tools/toolchain/install/setup
To build CP2K you should change directory:
  cd cp2k/
  make -j 80 ARCH=local VERSION="ssmp sdbg psmp pdbg"

Do exactly on the ending instruction

% cp /usr/local/software/cp2k/tools/toolchain/install/arch/* /usr/local/software/cp2k/arch
% source /usr/local/software/cp2k/tools/toolchain/install/setup
% cd /usr/local/software/cp2k
% make -j 32 ARCH=local VERSION="ssmp sdbg psmp pdbg"

If you encounter an error during making like the one below, just do an install for liblsan

% /usr/bin/ld: cannot find /usr/lib64/
% dnf install liblsan -y

If you encounter error like the ones below for fftw libraries,

/usr/bin/ld: cannot find -lfftw3_mpi
collect2: error: ld returned 1 exit status

You have to go to the supporting package libraries and do some editing.

% cd /usr/local/software/cp2k/tools/toolchain/install/fftw-3.3.10/lib
% ln -s libfftw3.a libfftw3_mpi.a
% ln -s

Try again

% cd /usr/local/software/cp2k
% make -j 32 ARCH=local VERSION="ssmp sdbg psmp pdbg"

If successful, you should see binaries at /usr/local/software/cp2k/exe/local

Efficient Heterogeneous Parallel Programming Using OpenMP

This article is taken from Intel “Efficient Heterogeneous Parallel Programming Using OpenMP”. In this article, we will show you how to do CPU+GPU asynchronous calculations using OpenMP.

In some cases, offloading computations to an accelerator like a GPU means that the host CPU sits idle until the offloaded computations are finished. However, using the CPU and GPU resources simultaneously can improve the performance of an application. In OpenMP® programs that take advantage of heterogenous parallelism, the master clause can be used to exploit simultaneous CPU and GPU execution. In this article, we will show you how to do CPU+GPU asynchronous calculation using OpenMP.

The Intel® oneAPI DPC++/C++ Compiler was used with following command-line options:
‑O3 ‑Ofast ‑xCORE‑AVX512 ‑mprefer‑vector‑width=512 ‑ffast‑math ‑qopt‑multiple‑gather‑scatter‑by‑shuffles ‑fimf‑precision=low
‑fiopenmp ‑fopenmp‑targets=spir64=”‑fp‑model=precise”

OpenMP provides true asynchronous, heterogeneous execution on CPU+GPU systems. It’s clear from our timing results and VTune profiles that keeping the CPU and GPU busy in the OpenMP parallel region gives the best performance. We encourage you to try this approach.

Intel: Efficient Heterogeneous Parallel Programming Using OpenMP (Best Practices to Keep the CPU and GPU Working at the Same Time)

Compiling ORCA-4.2.1 with OpenMPI-3.1.4

ORCA is a general-purpose quantum chemistry package that is free of charge for academic users. The Project and Download Website can be found at ORCA Forum

You have to register yourself before you can participate in the forum or download ORCA-4.2.1. The current latest version for ORCA is 5.0.3. The package you might want to consider is ORCA 4.2.1, Linux, x86-64, .tar.xz Archive

Prerequisites that I use.

Unpacking ORCA-4.2.1

% tar -xvf orca_4_2_1_linux_x86-64_openmpi314.tar.xz

Running ORCA. If your environment has Module Environment

% module load openmpi/3.1.4/gcc-6.5.0

If not, you have to pacify PATH and LD_LIBRARY_PATH, MANPATH


Typical Input file

Calling ORCA requires full pathing

/usr/local/orca_4_2_1_linux_x86-64_openmpi314/orca $INPUT > $OUTPUT "--bind-to core --verbose"

For Input File usage, you may want to take a look at the ORCA 4.2.1 Manual found when you unpack or you can look at it online at orca_manual_4_2_1.pdf ( .

For example…….

! B3LYP def2-SVP SP
tda false
nroots 50
triplets true
nprocs 32

* xyz 0 1
  Ir        0.00000        0.00000        0.03016
   N       -1.05797        1.55546       -1.09121
   N        1.87606        0.13850       -1.09121