Basic Tuning of RDMA Parameters for Spectrum Scale

If your cluster has symptoms of overload and GPFS kept reporting “overloaded” in GPFS logs like the ones below, you might get long waiters and sometimes deadlocks.

Wed Apr 11 15:53:44.232 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 15:55:24.488 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 15:57:04.743 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 15:58:44.998 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 16:00:25.253 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 16:28:45.601 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 16:33:56.817 2018: [N] sdrServ: Received deadlock notification from

Increase scatterBuffersize to a Number that match IB Fabric
One of the first tuning will be to tune the scatterBufferSize. According to the wiki, FDR10 can be tuned to 131072 and FDR14 can be tuned to 262144

The default value of 32768 may perform OK. If the CPU utilization on the NSD IO servers is observed to be high and client IO performance is lower than expected, increasing the value of scatterBufferSize on the clients may improve performance.

# mmchconfig scatterBufferSize=131072

There are other parameters which can be tuned. But the scatterBufferSize worked immediately for me.
verbsRdmaSend
verbsRdmasPerConnection
verbsRdmasPerNode

Disable  verbsRdmaSend=no

# mmchconfig verbsRdmaSend=no -N nsd1,nsd2

Verify settings has taken place

# mmfsadm dump config | grep verbsRdmasPerNode

Increase verbsRdmasPerNode to 514 for NSD Nodes

# mmchonfig verbsRdmasPerNode=514 -N nsd1,nsd2

References:

  1. Best Practices RDMA Tuning

Cannot initialize RDMA protocol on Cluster with Platform LSF

If you encounter this issue during an application run and your scheduler used is Platform LSF. There is a simple solution.

Symptoms

explicit_dp: Rank 0:13: MPI_Init_thread: didn't find active interface/port
explicit_dp: Rank 0:13: MPI_Init_thread: Can't initialize RDMA device
explicit_dp: Rank 0:13: MPI_Init_thread: Internal Error: Cannot initialize RDMA protocol
MPI Application rank 13 exited before MPI_Init() with status 1
mpirun: Broken pipe

Cause:
In this case the amount of locked memory was set to unlimited in /etc/security/limits.conf, but this was not sufficient.
The MPI jobs were started under LSF, but the lsf daemons were started with very small memory locked limits.

Solution:
Set the amount of locked memory to unlimited in /etc/init.d/lsf by adding the ‘ulimit -l unlimited’ command.

.....
.....
### END INIT INFO
ulimit -l unlimited
. /opt/lsf/conf/profile.lsf
.....
.....

References:

  1. HP HPC Linux Value Pack 3.1 – Platform MPI job failed