Cleaning up Platform LSF parallel Job Execution Problems – Part 3

This refers to Parallel job abnormal task exit

This article is taken from Cleaning up Platform LSF parallel job execution problems

 If some tasks exit abnormally during parallel job execution, LSF takes action to terminate and clean up the entire job. This behaviour can be customized with RTASK_GONE_ACTION in an application profile in lsb.applications or with the LSB_DJOB_RTASK_GONE_ACTION environment variable in the job environment.
The LSB_DJOB_RTASK_GONE_ACTION environment variable overrides the setting of RTASK_GONE_ACTION in lsb.applications.
 The following values are supported:
[KILLJOB_TASKDONE | KILLJOB_TASKEXIT] [IGNORE_TASKCRASH]
KILLJOB_TASKDONE:               LSF terminates all tasks in the job when one remote task exits with a zero value.
KILLJOB_TASKEXIT:               LSF terminates all tasks in the job when one remote task exits with non-zero value.
IGNORE_TASKCRASH:              LSF does nothing when a remote task crashes. The job continues to run to completion.
By default, RTASK_GONE_ACTION is not defined, so LSF terminates all tasks, and shuts down the entire job when one task crashes.
 For example:
  • Define an application profile in lsb.applications:
Begin Application
NAME         = myApp
DJOB_COMMFAIL_ACTION=IGNORE_COMMFAIL
RTASK_GONE_ACTION=”IGNORE_TASKCRASH KILLJOB_TASKEXIT”
DESCRIPTION  = Application profile example
End Application
  • Run badmin reconfig as LSF administrator to make the configuration take effect.
  • Submit an MPICH2 job with –app myApp:
$ bsub –app myApp –n4 –R “span[ptile=2]” mpiexec.hydra ./cpi

References:

  1. Cleaning up parallel job execution problems
  2. Cleaning up Platform LSF parallel Job Execution Problems – Part 1
  3. Cleaning up Platform LSF parallel Job Execution Problems – Part 2
  4. Cleaning up Platform LSF parallel Job Execution Problems – Part 3

 

Compiling Intel BLAS95 and LAPACK95 Interface Wrapper Library

BLAS95 and LAPACK95 wrappers to Intel MKL are delivered both in Intel MKL and as source code which can be compiled to build to build standalone wrapper library with exactly the same functionality.

The source code for the wrappers, makefiles are found …..\interfaces\blas95 subdirectory in the Intel MKL Directory

For blas95

# cd $MKLROOT
# cd interfaces/blas95
# make libintel64  INSTALL_DIR=$MKLROOT/lib/intel64

Once Compiled, the libraries are kept $MKLROOT/lib/intel64

For Lapack95

# cd $MKLROOT
# cd interfaces/lapack95
# make libintel64  INSTALL_DIR=$MKLROOT/lib/intel64

Once Compiled, the libraries are kept $MKLROOT/lib/intel64