Taken from IBM Spectrum LSF Wiki – Cleaning up Parallel Job Execution Problems.
- Clean up all left-over processes on all execution nodes
- Perform post-job cleanup operations on all execution nodes, such as cleaning up cgroups, cleaning up Kerberos credentials, resetting CPU frequencies, etc.
- Clean up the job from LSF and mark job Exit status
The LSF default behavior is designed to handle most common recovery for these scenarios. LSF also offers a set of parameters to allow end users to tune LSF behavior for each scenario, especially how fast LSF can detect each failure and what action LSF should take in response.
- First execution host crashing or hanging
- Non-first execution host crashing or hanging
- Parallel job tasks exit abnormally
In lsb.params
REMOVE_HUNG_JOBS_FOR = runlimit:host_unavail
LSF removes jobs if they run 10 minutes past the job RUN LIMIT or become UNKNOWN for 10 minutes due to the first execution host becoming unavailable. If you want to change the timing
REMOVE_HUNG_JOBS_FOR = runlimit[,wait_time=5]:host_unavail[,wait_time=5]
Other Information:
For DJOB_HB_INTERVAL, DJOB_RU_INTERVAL (lsb.applications) and LSF_RES_ALIVE_TIMEOUT (lsf.conf)
- The default value of LSB_DJOB_HB_INTERVAL is 120 seconds per 1000 nodes
- The default value of LSB_DJOB_RU_INTERVAL is 300 seconds per 1000 nodes
In case of large, long running parallel jobs, LSB_DJOB_RU_INTERVAL can be set to a long time or even disabled with a 0 value to prevent too frequent resource usage update, which consumes network bandwidth as well as CPU time for LSF to process large volume of resource usage information. LSB_DJOB_HB_INTERVAL cannot be disabled.
References: