Cannot initialize RDMA protocol on Cluster with Platform LSF

If you encounter this issue during an application run and your scheduler used is Platform LSF. There is a simple solution.

Symptoms

explicit_dp: Rank 0:13: MPI_Init_thread: didn't find active interface/port
explicit_dp: Rank 0:13: MPI_Init_thread: Can't initialize RDMA device
explicit_dp: Rank 0:13: MPI_Init_thread: Internal Error: Cannot initialize RDMA protocol
MPI Application rank 13 exited before MPI_Init() with status 1
mpirun: Broken pipe

Cause:
In this case the amount of locked memory was set to unlimited in /etc/security/limits.conf, but this was not sufficient.
The MPI jobs were started under LSF, but the lsf daemons were started with very small memory locked limits.

Solution:
Set the amount of locked memory to unlimited in /etc/init.d/lsf by adding the ‘ulimit -l unlimited’ command.

.....
.....
### END INIT INFO
ulimit -l unlimited
. /opt/lsf/conf/profile.lsf
.....
.....

References:

  1. HP HPC Linux Value Pack 3.1 – Platform MPI job failed

Resolving Unable to determine user account for execution

If you are facing this issue “Unable to determine user account for execution;”, it is likely casued by the fact that LSF cluster must be restarted if user authentication is switched from NIS to LDAP.

After user authentication method is switched from NIS to LDAP, all user jobs are pending with the pending reason: “Unable to determine user account for execution”.

If user authentication mode is changed, you have to restart LSF daemons on all LSF hosts for jobs to run successfully. On the LSF master master host do:

# lsadmin reconfig
# badmin mbdrestart
# badmin hrestart all

References:

  1. LSF cluster must be restarted if user authentication is switched from NIS to LDAP

Packing serial jobs neatly in Platform LSF

Taken from Placing jobs based on available job slots of hosts

Platform LSF allowed you to pack or spread jobs as required. Before that, just a few terms to defined

  1. Packing means always placing jobs on the hosts with the least available slots first. Packing jobs can make room for bigger parallel jobs.
  2. Spreading tries to spread jobs out and places jobs on the hosts with the most available slots first. Spreading jobs maximizes the performance of individual jobs.

I will deal with one situation where I want to pack all the serial jobs neatly in as few nodes as possible.

But here are some terms from the LSF Wiki

The slots keyword represents available slots on each host and it is a built-in numeric decreasing resource. When a job occupies some of the job slots on the host, the slots resource value is decreased accordingly. For example, if MXJ of an LSF host is defined as 8, the slots value will be 8 when the host is empty. When 6 LSF job slots have been occupied, slots becomes 2. The slots resource can only be used in select[] and order[] sections of a job resource requirement string. To apply a job packing or spreading policy, you can use the order[] section in the job resource requirement. For example, -R “order[-slots]” will order candidate hosts based on the least available slots, while –R “order[slots]” will order candidate hosts based on the hosts with the most available slots.

To use ! in an order[] clause, you must set SCHED_PER_JOB_SORT=Y in lsb.params. To make the parameter take effect, run badmin mbdrestart or badmin reconfig on the master host to reconfigure mbatchd.

The following is an example of using the slots resource:
Step 1: Configure RES_REQ in a Queue section of lsb.queues.

Begin Queue
QUEUE_NAME = myqueue
…
RES_REQ = order[-slots]
…

End Queue

Step 2: Make the configuration take effects

# badmin reconfig

Step 3: Check whether the configuration take effects.

# bqueues -l myqueue
QUEUE: myqueue
…
RES_REQ: order[-slots]

You can also do it at the application level. For more information, please read the PLatfrom LSF Wiki

References:

  1. Placing jobs based on available job slots of hosts

Setting up Secondary Master Host on Platform LSF

To setup Secondary Master Host on Platform LSF can be very easy.

Step 1: Update the LSF_MASTER_LIST parameter in lsf.conf by updating the current master host

# cd $LSF_ENVDIR
# vim lsf.conf

At line 114

.....
LSF_MASTER_LIST="h00 h01"
.....

If you wish to switch the order of Master Host for maintenance, it can be done as well

Step 2: Reconfigure the cluster and restart the LSF mbatchd and mbschd processes

# lsadmin reconfig
# badmin mbdrestart

Step 3: Update the master_hosts in lsb.hosts

# cd $LSF_ENVDIR/lsbatch/yourhpclcuster/configdir/configdir
# vim lsb.hosts
Begin HostGroup
GROUP_NAME    GROUP_MEMBER      #GROUP_ADMIN # Key words
master_hosts      (h00 h01)
End HostGroup

References:

  1. Switch LSF master host to secondary master candidate

Submitting an interactive job on LSF Platform

Using a Pseudo-terminal to launch Interactive Job

Point 1: Submit a batch interactive job using a pseudo-terminal.

$ bsub -Ip vim output.log

Submits a batch interactive job to edit output.log.

Point 2:  Submit a batch interactive job and create a pseudo-terminal with shell mode support.

$ bsub -Is bash

Submits a batch interactive job that starts up bash as an interactive shell.

When you specify the -Is option, bsub submits a batch interactive job and creates a pseudo-terminal with shell mode support when the job starts.

References:

  1. Submit an interactive job by using a pseudo-terminal

Cleaning up Platform LSF parallel Job Execution Problems – Part 3

This refers to Parallel job abnormal task exit

This article is taken from Cleaning up Platform LSF parallel job execution problems

 If some tasks exit abnormally during parallel job execution, LSF takes action to terminate and clean up the entire job. This behaviour can be customized with RTASK_GONE_ACTION in an application profile in lsb.applications or with the LSB_DJOB_RTASK_GONE_ACTION environment variable in the job environment.
The LSB_DJOB_RTASK_GONE_ACTION environment variable overrides the setting of RTASK_GONE_ACTION in lsb.applications.
 The following values are supported:
[KILLJOB_TASKDONE | KILLJOB_TASKEXIT] [IGNORE_TASKCRASH]
KILLJOB_TASKDONE:               LSF terminates all tasks in the job when one remote task exits with a zero value.
KILLJOB_TASKEXIT:               LSF terminates all tasks in the job when one remote task exits with non-zero value.
IGNORE_TASKCRASH:              LSF does nothing when a remote task crashes. The job continues to run to completion.
By default, RTASK_GONE_ACTION is not defined, so LSF terminates all tasks, and shuts down the entire job when one task crashes.
 For example:
  • Define an application profile in lsb.applications:
Begin Application
NAME         = myApp
DJOB_COMMFAIL_ACTION=IGNORE_COMMFAIL
RTASK_GONE_ACTION=”IGNORE_TASKCRASH KILLJOB_TASKEXIT”
DESCRIPTION  = Application profile example
End Application
  • Run badmin reconfig as LSF administrator to make the configuration take effect.
  • Submit an MPICH2 job with –app myApp:
$ bsub –app myApp –n4 –R “span[ptile=2]” mpiexec.hydra ./cpi

References:

  1. Cleaning up parallel job execution problems
  2. Cleaning up Platform LSF parallel Job Execution Problems – Part 1
  3. Cleaning up Platform LSF parallel Job Execution Problems – Part 2
  4. Cleaning up Platform LSF parallel Job Execution Problems – Part 3

 

Cleaning up Platform LSF parallel Job Execution Problems – Part 2

This refers to Parallel job non-first execution host crashing or hanging.

Taken from IBM Spectrum LSF Wiki – Cleaning up Parallel Job Execution Problems

Scenario 1: LSB_FANOUT_TIMEOUT_PER_LAYER (lsf.conf)

Before a parallel job executes, LSF needs to do a few set up work on each job execution host and populate job information to all these hosts. LSF provides a communication fan-out framework to handle this. In the case of execution hosts failure, the framework has timeout value to control how quick LSF treats communication failure and roll back the job dispatching decision. By default, the timeout value is 20 seconds for each communication layer. Define LSB_FANOUT_TIMEOUT_PER_LAYER in lsf.conf to customize the timeout value.

# badmin hrestart all

Important Notes

  1.  LSB_FANOUT_TIMEOUT_PER_LAYER can also be defined in environment before job submission to override the value specified in lsf.conf.
  2. You can set a larger value for large size jobs (for example, 60 for jobs across over 1K nodes).
  3.  One indicator of the need to tune up this parameter is that bhist -l shows jobs bouncing back and forth between starting and pending due to job timeout errors. Timeout errors are logged in the sbatchd log.
$ bhist -l 100
Job , User , Project , Command 
Mon Oct 21 19:20:43: Submitted from host , to Queue , CW
                     D , 320 Processors Requested, Reque
                     sted Resources <span[ptile=8]>;
Mon Oct 21 19:20:43: Dispatched to 40 Hosts/Processors   <
……
Mon Oct 21 19:20:43: Starting (Pid 19137);
Mon Oct 21 19:21:06: Pending: Failed to send fan-out information to other SBDs;

Scenario 2: LSF_DJOB_TASK_REG_WAIT_TIME (lsf.conf)

When a parallel job is started, an LSF component on the first execution host needs to receive a registration message from other components on non-first execution hosts. By default, LSF waits for 300 seconds for those registration messages. After 300 seconds, LSF starts to clean up the job.

Use LSF_DJOB_TASK_REG_WAIT_TIME customize the time period. The parameter can be defined in lsf.conf or the job environment at job submission. The parameter in lsf.conf applies to all jobs in the cluster, while the job environment variable only controls the behaviour for the particular job. The job environment variable overrides the value in lsf.conf. The unit is seconds. Set a larger value for large jobs ( for example, 600 seconds for jobs across 5000 nodes).

# lsadmin resrestart

You should set this parameter if you see an INFO level message like the following in res.log.first_execution_host:

$ grep “waiting for all tasks to register” res.log.hostA
Oct 20 20:20:29 2013 7866 6 9.1.1 doHouseKeeping4ParallelJobs: job 101 timed out (20) waiting for all tasks to register, registered (315) out of (320)

3. DJOB_COMMFAIL_ACTION (lsb.applications)

After a job is successfully launched and all tasks register themselves, LSF keeps monitoring the connection from the first node to the rest of the execution nodes. If a connection failure is detected, by default, LSF begins to shut down the job. Configure DJOB_COMMFAIL_ACTION in an application profile in lsb.applications to customize the behaviour. The parameter syntax is:

DJOB_COMMFAIL_ACTION=”KILL_TASKS|IGNORE_COMMFAIL”

IGNORE_COMMFAIL:     LSF allows the job to continue to run. Communication failures between the first node and the rest of the execution nodes are ignored and the job continues.

KILL_TASKS    LSF tries to kill all the current tasks of a parallel or distributed job associated with the communication failure.

By default, DJOB_COMMFAIL_ACTION is not defined – LSF terminates all tasks and shuts down the entire job.

You can also use the environment variable LSB_DJOB_COMMFAIL_ACTION before submitting job to override the value set in the application profile.

References:

  1. Cleaning up parallel job execution problems
  2. Cleaning up Platform LSF parallel Job Execution Problems – Part 1
  3. Cleaning up Platform LSF parallel Job Execution Problems – Part 2
  4. Cleaning up Platform LSF parallel Job Execution Problems – Part 3

Cleaning up Platform LSF parallel Job Execution Problems – Part 1

Taken from IBM Spectrum LSF Wiki – Cleaning up Parallel Job Execution Problems. 

 Job cleanup refers to the following:
  1. Clean up all left-over processes on all execution nodes
  2. Perform post-job cleanup operations on all execution nodes, such as cleaning up cgroups, cleaning up Kerberos credentials, resetting CPU frequencies, etc.
  3. Clean up the job from LSF and mark job Exit status

The LSF default behavior is designed to handle most common recovery for these scenarios. LSF also offers a set of parameters to allow end users to tune LSF behavior for each scenario, especially how fast LSF can detect each failure and what action LSF should take in response.

 There are typically three scenarios requiring job cleanup:
  1. First execution host crashing or hanging
  2. Non-first execution host crashing or hanging
  3. Parallel job tasks exit abnormally
This article describes how to configure LSF to handle these scenarios.
 Scenario 1 – Parallel job first execution host crashing or hanging
When the first execution host crashes or hangs, by default, LSF will mark a running job as UNKNOWN. LSF does not clean up the job until the host comes back and the LSF master confirms that the job is really gone from the system. However, this default behaviour may not always be desirable, since such hung jobs will hold their resource allocation for some time. Define REMOVE_HUNG_JOBS_FOR in lsb.params to change the default LSF behaviour and remove the hung jobs from the system automatically.

In lsb.params

REMOVE_HUNG_JOBS_FOR = runlimit:host_unavail

LSF removes jobs if they run 10 minutes past the job RUN LIMIT or become UNKNOWN for 10 minutes due to the first execution host becoming unavailable. If you want to change the timing

 In lsb.params
REMOVE_HUNG_JOBS_FOR = runlimit[,wait_time=5]:host_unavail[,wait_time=5]

 

Other Information:

For DJOB_HB_INTERVAL, DJOB_RU_INTERVAL (lsb.applications) and LSF_RES_ALIVE_TIMEOUT (lsf.conf)

  1. The default value of LSB_DJOB_HB_INTERVAL is 120 seconds per 1000 nodes
  2. The default value of LSB_DJOB_RU_INTERVAL is 300 seconds per 1000 nodes

In case of large, long running parallel jobs, LSB_DJOB_RU_INTERVAL can be set to a long time or even disabled with a 0 value to prevent too frequent resource usage update, which consumes network bandwidth as well as CPU time for LSF to process large volume of resource usage information. LSB_DJOB_HB_INTERVAL cannot be disabled.

References: 

  1. Cleaning up parallel job execution problems
  2. Cleaning up Platform LSF parallel Job Execution Problems – Part 1
  3. Cleaning up Platform LSF parallel Job Execution Problems – Part 2
  4. Cleaning up Platform LSF parallel Job Execution Problems – Part 3

Submitting Jobs with Topology Scheduling on Platform LSF

This blog is a follow-up Topology Scheduling on Platform LSF

Scenario 1: Submit directly to a specific Compute Unit Type

$ bsub -m "r1" -n 64 ./a.out

This job asks for 64 slots, all of which must be on hosts in the CU r1.

Scenario 2: Requesting for a Compute Unit Type Level (For example rack)

$ bsub -R "cu[type=rack]" -n 64 ./a.out

Scenario 3: Sequential Job Packing
The following job sets a CU uses minavail to set preferences for the fewest free slots.

$ bsub -R "cu[pref=minavail]" ./a.out

Scenario 4: Parallel Job Packing
The following job sets a CU uses maxavail to set a preference for the largest free slots

$ bsub -R "cu[pref=maxavail]" -n 64 ./a.out

Scenario 5: Limiting the number of spaning of multiple CUs
The following allow a job to span 2 CUs of CU-Type belonging to “rack” with the largest free slots

$ bsub -R "cu[type=rack:pref=maxavail:maxcus=2]" -n 32 ./a.out

Preferences:

  1. Using Compute Units for Topology Scheduling
  2. Topology Scheduling on Platform LSF