May 11, 2017 – The Linux Cluster

This refers to Parallel job non-first execution host crashing or hanging.

Taken from IBM Spectrum LSF Wiki – Cleaning up Parallel Job Execution Problems

Scenario 1: LSB_FANOUT_TIMEOUT_PER_LAYER (lsf.conf)

Before a parallel job executes, LSF needs to do a few set up work on each job execution host and populate job information to all these hosts. LSF provides a communication fan-out framework to handle this. In the case of execution hosts failure, the framework has timeout value to control how quick LSF treats communication failure and roll back the job dispatching decision. By default, the timeout value is 20 seconds for each communication layer. Define LSB_FANOUT_TIMEOUT_PER_LAYER in lsf.conf to customize the timeout value.

# badmin hrestart all

Important Notes

LSB_FANOUT_TIMEOUT_PER_LAYER can also be defined in environment before job submission to override the value specified in lsf.conf.
You can set a larger value for large size jobs (for example, 60 for jobs across over 1K nodes).
One indicator of the need to tune up this parameter is that bhist -l shows jobs bouncing back and forth between starting and pending due to job timeout errors. Timeout errors are logged in the sbatchd log.

$ bhist -l 100
Job , User , Project , Command 
Mon Oct 21 19:20:43: Submitted from host , to Queue , CW
                     D , 320 Processors Requested, Reque
                     sted Resources <span[ptile=8]>;
Mon Oct 21 19:20:43: Dispatched to 40 Hosts/Processors   <
……
Mon Oct 21 19:20:43: Starting (Pid 19137);
Mon Oct 21 19:21:06: Pending: Failed to send fan-out information to other SBDs;

Scenario 2: LSF_DJOB_TASK_REG_WAIT_TIME (lsf.conf)

When a parallel job is started, an LSF component on the first execution host needs to receive a registration message from other components on non-first execution hosts. By default, LSF waits for 300 seconds for those registration messages. After 300 seconds, LSF starts to clean up the job.

Use LSF_DJOB_TASK_REG_WAIT_TIME customize the time period. The parameter can be defined in lsf.conf or the job environment at job submission. The parameter in lsf.conf applies to all jobs in the cluster, while the job environment variable only controls the behaviour for the particular job. The job environment variable overrides the value in lsf.conf. The unit is seconds. Set a larger value for large jobs ( for example, 600 seconds for jobs across 5000 nodes).

# lsadmin resrestart

You should set this parameter if you see an INFO level message like the following in res.log.first_execution_host:

$ grep “waiting for all tasks to register” res.log.hostA
Oct 20 20:20:29 2013 7866 6 9.1.1 doHouseKeeping4ParallelJobs: job 101 timed out (20) waiting for all tasks to register, registered (315) out of (320)

3. DJOB_COMMFAIL_ACTION (lsb.applications)

After a job is successfully launched and all tasks register themselves, LSF keeps monitoring the connection from the first node to the rest of the execution nodes. If a connection failure is detected, by default, LSF begins to shut down the job. Configure DJOB_COMMFAIL_ACTION in an application profile in lsb.applications to customize the behaviour. The parameter syntax is:

DJOB_COMMFAIL_ACTION=”KILL_TASKS|IGNORE_COMMFAIL”

IGNORE_COMMFAIL: LSF allows the job to continue to run. Communication failures between the first node and the rest of the execution nodes are ignored and the job continues.

KILL_TASKS LSF tries to kill all the current tasks of a parallel or distributed job associated with the communication failure.

By default, DJOB_COMMFAIL_ACTION is not defined – LSF terminates all tasks and shuts down the entire job.

You can also use the environment variable LSB_DJOB_COMMFAIL_ACTION before submitting job to override the value set in the application profile.

References:

The Linux Cluster

Linux Cluster Blog is a collection of how-to and tutorials for Linux Cluster and Enterprise Linux

Day: May 11, 2017

Common LSF problems

Cleaning up Platform LSF parallel Job Execution Problems – Part 2