Efficient Heterogeneous Parallel Programming Using OpenMP

This article is taken from Intel “Efficient Heterogeneous Parallel Programming Using OpenMP”. In this article, we will show you how to do CPU+GPU asynchronous calculations using OpenMP.

In some cases, offloading computations to an accelerator like a GPU means that the host CPU sits idle until the offloaded computations are finished. However, using the CPU and GPU resources simultaneously can improve the performance of an application. In OpenMP® programs that take advantage of heterogenous parallelism, the master clause can be used to exploit simultaneous CPU and GPU execution. In this article, we will show you how to do CPU+GPU asynchronous calculation using OpenMP.

The Intel® oneAPI DPC++/C++ Compiler was used with following command-line options:
‑O3 ‑Ofast ‑xCORE‑AVX512 ‑mprefer‑vector‑width=512 ‑ffast‑math ‑qopt‑multiple‑gather‑scatter‑by‑shuffles ‑fimf‑precision=low
‑fiopenmp ‑fopenmp‑targets=spir64=”‑fp‑model=precise”

OpenMP provides true asynchronous, heterogeneous execution on CPU+GPU systems. It’s clear from our timing results and VTune profiles that keeping the CPU and GPU busy in the OpenMP parallel region gives the best performance. We encourage you to try this approach.

Intel: Efficient Heterogeneous Parallel Programming Using OpenMP (Best Practices to Keep the CPU and GPU Working at the Same Time)

Compiling ORCA-4.2.1 with OpenMPI-3.1.4

ORCA is a general-purpose quantum chemistry package that is free of charge for academic users. The Project and Download Website can be found at ORCA Forum

You have to register yourself before you can participate in the forum or download ORCA-4.2.1. The current latest version for ORCA is 5.0.3. The package you might want to consider is ORCA 4.2.1, Linux, x86-64, .tar.xz Archive

Prerequisites that I use.

Unpacking ORCA-4.2.1

% tar -xvf orca_4_2_1_linux_x86-64_openmpi314.tar.xz

Running ORCA. If your environment has Module Environment

% module load openmpi/3.1.4/gcc-6.5.0

If not, you have to pacify PATH and LD_LIBRARY_PATH, MANPATH


Typical Input file

Calling ORCA requires full pathing

/usr/local/orca_4_2_1_linux_x86-64_openmpi314/orca $INPUT > $OUTPUT "--bind-to core --verbose"

For Input File usage, you may want to take a look at the ORCA 4.2.1 Manual found when you unpack or you can look at it online at orca_manual_4_2_1.pdf (enea.it) .

For example…….

! B3LYP def2-SVP SP
tda false
nroots 50
triplets true
nprocs 32

* xyz 0 1 fac_irppy3.xyz
  Ir        0.00000        0.00000        0.03016
   N       -1.05797        1.55546       -1.09121
   N        1.87606        0.13850       -1.09121

Compiling LAMMPS-15Jun20 with GNU 6 and OpenMPI 3



Download the latest tar.gz from https://lammps.sandia.gov/

Step 1: Untar LAMMPS

% tar -zxvf lammps-stable.tar.gz

Step 2: Go to $LAMMPS_HOME/src. Make Standard Packages

% cd src
% make yes-standard
% make no-gpu
% make no-mscg

Step 3: Compile message libraries

% cd lammps-15Jun20/lib/message/cslib/src
% make lib_parallel zmq=no

Copy and rename the produced cslib/src/libcsmpi.a or libscnompi.a file to cslib/src/libmessage.a

% cp cslib/src/libcsmpi.a cslib/src/libmessage.a

Copy either lammps-15Jun20/lib/message/Makefile.lammps.zmq or Makefile.lammps.nozmq to lib/message/Makefile.lammps

% cp Makefile.lammps.nozmq Makefile.lammps

Step 4: Compile poems

% cd lammps-15Jun20/lib/poems
% make -f Makefile.g++

Step 5: Compile latte
Download LATTE code and unpack the tarball either in this /lib/latte directory

% git clone https://github.com/lanl/LATTE

Inside lammps-15Jun20/lib/latte/LATTE
Modify the makefile.CHOICES according to your system architecture and compilers

% cd lammps-15Jun20/lib/latte/LATTE
% cp makefile.CHOICES makefile.CHOICES.gfort
% make
% cd lammps-15Jun20/lib/latte
% ln -s ./LATTE/src includelink
% ln -s ./LATTE liblink
% ln -s ./LATTE/src/latte_c_bind.o filelink.o
% cp Makefile.lammps.gfortran Makefile.lammps

Step 6. Compile Voronoi

Download voro++-0.4.6.tar.gz from http://math.lbl.gov/voro++/download/
Untar the voro++-0.4.6.tar.gz inside lammps-15Jun20/lib/voronoi/

% tar -zxvf voro++-0.4.6.tar.gz
% cd lammps-15Jun20/lib/voronoi/voro++-0.4.6
% make

Step 7: Compile kim

Download kim from https://openkim.org/doc/usage/obtaining-models . The current version is  kim-api-2.1.3.txz

Download at /lammps-15Jun20/lib/kim

% cd lammps-15Jun20/lib/kim
% tar Jxvf kim-api-2.1.3.txz
% cd kim-api-2.1.3
% mkdir build
% cd build
% cmake .. -DCMAKE_INSTALL_PREFIX=${PWD}/../../installed-kim-api-2.1.3
% make -j2
% make install
% cd /lammps-15Jun20/lib/kim/installed-kim-api-2.1.3/
% source ${PWD}/kim-api-X2.1.3/bin/kim-api-activate
% kim-api-collections-management install system EAM_ErcolessiAdams_1994_Al__MO_324507536345_002

Step 8: Compile USER-COLVARS

% cd lammps-15Jun/lib/colvars
% make -f Makefile.g++

Step 9: Check Packages Status

% make package-status
[root@hpc-gekko1 src]# make package-status
Installed YES: package ASPHERE
Installed YES: package BODY
Installed YES: package CLASS2
Installed YES: package COLLOID
Installed YES: package COMPRESS
Installed YES: package CORESHELL
Installed YES: package DIPOLE
Installed  NO: package GPU
Installed YES: package GRANULAR
Installed YES: package KIM

Step 9a: To Activate Standard Package

% make yes-standard

Step 9b: To activate USER-COLVARS, USER-OMP

% make yes-user-colvars
% make yes-user-omp

Step 9c: To deactivate GPGPU

% make no-gpu

Step 10: Finally Compile LAMMPS

% cd lammps-15Jun20/src
% make g++_openmpi -j 16

You should have binary called lmp_g++_openmpi
Do a softlink

ln -s lmp_g++_openmpi lammps

Compiling OpenMPI-3.1.6 with GCC-6.5

We assumed that you have installed GNU 6.5 and isl-0.15

Download the latest OpenMPI 3.1.6 package from OpenMPI site

% ./configure --prefix=/usr/local/gnu/openmpi-3.1.6 --enable-orterun-prefix-by-default --enable-mpi-cxx --enable-openib-rdmacm-ibaddr --enable-mca-no-build=btl-uct

–enable-orterun-prefix-by-default (Configure OMPI –enable-orterun-prefix-by-default and so that you do not need to add the prefix option)
–enable-openib-rdmacm-ibaddr (To enable routing over IB)
–enable-mpi-cxx (C++ bindings are no more built by default)
–enable-mca-no-build=btl-uct (ecent OpenMPI versions contain a BTL component called ‘uct’, which can cause data corruption when enabled, due to conflict on malloc hooks between OPAL and UCM.)

% make all install | tee install.log


  1. Intel Community – Caught Signal 11 (Segmentation Fault: Does not mapped to object at)
  2. Open MPI + Scalasca :Can not run mpirun command with option –prefix?