Cleaning up Platform LSF parallel Job Execution Problems – Part 1

Taken from IBM Spectrum LSF Wiki – Cleaning up Parallel Job Execution Problems. 

 Job cleanup refers to the following:
  1. Clean up all left-over processes on all execution nodes
  2. Perform post-job cleanup operations on all execution nodes, such as cleaning up cgroups, cleaning up Kerberos credentials, resetting CPU frequencies, etc.
  3. Clean up the job from LSF and mark job Exit status

The LSF default behavior is designed to handle most common recovery for these scenarios. LSF also offers a set of parameters to allow end users to tune LSF behavior for each scenario, especially how fast LSF can detect each failure and what action LSF should take in response.

 There are typically three scenarios requiring job cleanup:
  1. First execution host crashing or hanging
  2. Non-first execution host crashing or hanging
  3. Parallel job tasks exit abnormally
This article describes how to configure LSF to handle these scenarios.
 Scenario 1 – Parallel job first execution host crashing or hanging
When the first execution host crashes or hangs, by default, LSF will mark a running job as UNKNOWN. LSF does not clean up the job until the host comes back and the LSF master confirms that the job is really gone from the system. However, this default behaviour may not always be desirable, since such hung jobs will hold their resource allocation for some time. Define REMOVE_HUNG_JOBS_FOR in lsb.params to change the default LSF behaviour and remove the hung jobs from the system automatically.

In lsb.params

REMOVE_HUNG_JOBS_FOR = runlimit:host_unavail

LSF removes jobs if they run 10 minutes past the job RUN LIMIT or become UNKNOWN for 10 minutes due to the first execution host becoming unavailable. If you want to change the timing

 In lsb.params
REMOVE_HUNG_JOBS_FOR = runlimit[,wait_time=5]:host_unavail[,wait_time=5]

 

Other Information:

For DJOB_HB_INTERVAL, DJOB_RU_INTERVAL (lsb.applications) and LSF_RES_ALIVE_TIMEOUT (lsf.conf)

  1. The default value of LSB_DJOB_HB_INTERVAL is 120 seconds per 1000 nodes
  2. The default value of LSB_DJOB_RU_INTERVAL is 300 seconds per 1000 nodes

In case of large, long running parallel jobs, LSB_DJOB_RU_INTERVAL can be set to a long time or even disabled with a 0 value to prevent too frequent resource usage update, which consumes network bandwidth as well as CPU time for LSF to process large volume of resource usage information. LSB_DJOB_HB_INTERVAL cannot be disabled.

References: 

  1. Cleaning up parallel job execution problems
  2. Cleaning up Platform LSF parallel Job Execution Problems – Part 1
  3. Cleaning up Platform LSF parallel Job Execution Problems – Part 2
  4. Cleaning up Platform LSF parallel Job Execution Problems – Part 3

Submitting Jobs with Topology Scheduling on Platform LSF

This blog is a follow-up Topology Scheduling on Platform LSF

Scenario 1: Submit directly to a specific Compute Unit Type

$ bsub -m "r1" -n 64 ./a.out

This job asks for 64 slots, all of which must be on hosts in the CU r1.

Scenario 2: Requesting for a Compute Unit Type Level (For example rack)

$ bsub -R "cu[type=rack]" -n 64 ./a.out

Scenario 3: Sequential Job Packing
The following job sets a CU uses minavail to set preferences for the fewest free slots.

$ bsub -R "cu[pref=minavail]" ./a.out

Scenario 4: Parallel Job Packing
The following job sets a CU uses maxavail to set a preference for the largest free slots

$ bsub -R "cu[pref=maxavail]" -n 64 ./a.out

Scenario 5: Limiting the number of spaning of multiple CUs
The following allow a job to span 2 CUs of CU-Type belonging to “rack” with the largest free slots

$ bsub -R "cu[type=rack:pref=maxavail:maxcus=2]" -n 32 ./a.out

Preferences:

  1. Using Compute Units for Topology Scheduling
  2. Topology Scheduling on Platform LSF