Cannot initialize RDMA protocol on Cluster with Platform LSF


If you encounter this issue during an application run and your scheduler used is Platform LSF. There is a simple solution.

Symptoms

explicit_dp: Rank 0:13: MPI_Init_thread: didn't find active interface/port
explicit_dp: Rank 0:13: MPI_Init_thread: Can't initialize RDMA device
explicit_dp: Rank 0:13: MPI_Init_thread: Internal Error: Cannot initialize RDMA protocol
MPI Application rank 13 exited before MPI_Init() with status 1
mpirun: Broken pipe

Cause:
In this case the amount of locked memory was set to unlimited in /etc/security/limits.conf, but this was not sufficient.
The MPI jobs were started under LSF, but the lsf daemons were started with very small memory locked limits.

Solution:
Set the amount of locked memory to unlimited in /etc/init.d/lsf by adding the ‘ulimit -l unlimited’ command.

.....
.....
### END INIT INFO
ulimit -l unlimited
. /opt/lsf/conf/profile.lsf
.....
.....

References:

  1. HP HPC Linux Value Pack 3.1 – Platform MPI job failed

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.