Cleaning up Platform LSF parallel Job Execution Problems – Part 1

Taken from IBM Spectrum LSF Wiki – Cleaning up Parallel Job Execution Problems.

Job cleanup refers to the following:

Clean up all left-over processes on all execution nodes
Perform post-job cleanup operations on all execution nodes, such as cleaning up cgroups, cleaning up Kerberos credentials, resetting CPU frequencies, etc.
Clean up the job from LSF and mark job Exit status

The LSF default behavior is designed to handle most common recovery for these scenarios. LSF also offers a set of parameters to allow end users to tune LSF behavior for each scenario, especially how fast LSF can detect each failure and what action LSF should take in response.

There are typically three scenarios requiring job cleanup:

First execution host crashing or hanging
Non-first execution host crashing or hanging
Parallel job tasks exit abnormally

This article describes how to configure LSF to handle these scenarios.

Scenario 1 – Parallel job first execution host crashing or hanging

When the first execution host crashes or hangs, by default, LSF will mark a running job as UNKNOWN. LSF does not clean up the job until the host comes back and the LSF master confirms that the job is really gone from the system. However, this default behaviour may not always be desirable, since such hung jobs will hold their resource allocation for some time. Define REMOVE_HUNG_JOBS_FOR in lsb.params to change the default LSF behaviour and remove the hung jobs from the system automatically.

In lsb.params

REMOVE_HUNG_JOBS_FOR = runlimit:host_unavail

LSF removes jobs if they run 10 minutes past the job RUN LIMIT or become UNKNOWN for 10 minutes due to the first execution host becoming unavailable. If you want to change the timing

In lsb.params

REMOVE_HUNG_JOBS_FOR = runlimit[,wait_time=5]:host_unavail[,wait_time=5]

Other Information:

For DJOB_HB_INTERVAL, DJOB_RU_INTERVAL (lsb.applications) and LSF_RES_ALIVE_TIMEOUT (lsf.conf)

The default value of LSB_DJOB_HB_INTERVAL is 120 seconds per 1000 nodes
The default value of LSB_DJOB_RU_INTERVAL is 300 seconds per 1000 nodes

In case of large, long running parallel jobs, LSB_DJOB_RU_INTERVAL can be set to a long time or even disabled with a 0 value to prevent too frequent resource usage update, which consumes network bandwidth as well as CPU time for LSF to process large volume of resource usage information. LSB_DJOB_HB_INTERVAL cannot be disabled.

References:

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

The Linux Cluster

Linux Cluster Blog is a collection of how-to and tutorials for Linux Cluster and Enterprise Linux

Cleaning up Platform LSF parallel Job Execution Problems – Part 1

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply