The variables are as followed:
LSB_JOBID: LSF assigned job ID
LSB_BATCH_JID: Array job ID. Includes job ID and array index number
LSB_JOBINDEX: Job array index
References:
MAUI and MOAB
The variables are as followed:
LSB_JOBID: LSF assigned job ID
LSB_BATCH_JID: Array job ID. Includes job ID and array index number
LSB_JOBINDEX: Job array index
References:
If you are planning to have more nodes where the users can do submission apart from the Head Node of the Cluster, you may want to configure a Submission Node. By default, TORQUE only allow one submission node. There are 2 ways to configure this submission node. One way is using the “submit_hosts paramter” in the Torque Server.
Step 1a: Configuring the Submission
First and Foremost, one of the main prerequisites is that the submission nodes must be part of the resource pool identified by the Torque Server. If you are not part of the Torque Server, you may want to follow the steps to make the to-be-submission node part of the resource pool or a pbs_mom client. You can check the setup by looking at the Installing Torque 4.2.5 on CentOS 6. Configuring the TORQUE Clients. You might want to follow up with this optional setup Adding and Specifying Compute Resources at Torque to make sure your cores count are correct.
Step 1b: Ensure the exchange keys between submission node and Torque Server
For more information, see Auto SSH Login without Password
Step 1c: Configure the submission node as a non-default queue (Optional)
For more information, see Using Torque to set up a Queue to direct users to a subset of resources
Step 2: Registering the Submission Node in Torque
If you do not wish the compute node to be a compute resource, you can put a non-default queue or unique queue which users will not send to.
Once you have configured the to-be-submission node as one of the client, you have to now to configure the torque server by this commands.
# qmgr -c 'set server submit_hosts = hostname1'
# qmgr -c 'set server allow_node_submit = True'
Step 3: Putting Submission Node inside Torque Server /etc/hosts.equiv
# vim /etc/hosts.equiv submission_node.cluster.com
Step 4a: Copy trqauthd from primary submission node to the secondary submission node
# scp -v /etc/init.d/trqauthd root@submission_node.cluster.com:/etc/init.d
Step 4b: Start the trqauthd service on the submission node
# service trqauthd start
Step 5: Test the configuration
Do a
$ qsub -I nodes=1:ppn=8
You should see from the torque server that the job has been submitted via the submission node by doing a qstat -an
$ qstat -an
Step 6: Mount Maui Information from PBS/MAUI Server
From the MAUI Server, do a NFS, mount the configuration and binaries of MAUI
Edit /etc/exports
/opt/maui Submission-Node1(rw,no_root_squash,async,no_subtree_check) /usr/local/maui Submission-Node1(rw,no_root_squash,async,no_subtree_check)
At the MAUI Server, restart NFS Services
# service restart nfs
At the submission node, make sure you have the mount point /opt/maui and /usr/local/maui for the
At /etc/fstab, mount the file system and restart netfs
head-node1:/usr/local/maui /usr/local/maui nfs defaults 0 0 head-node1:/opt/maui /opt/maui nfs defaults 0 0
#service netfs restart
Resources:
When there is a job stuck and cannot be remove by a normal qdel, you can use the command qdel -p jobid. Do note that this command should be used when there is no other way to kill off the job in the usual fashion especially if the compute node is unresponsive.
# qdel -p jobID
References:
If you are running clusters, you may want to set up a queue to direct users to a subset of resources with Torque. For example, I may wish to direct a users who needs specific resources like MATLAB to a particular queue.
More information can be found at Torque Documents 4.1 “4.1.4 Mapping a Queue to a subset of Resources”
….The simplest method is using default_resources.neednodes on an execution queue, setting it to a particular node attribute. Maui/Moab will use this information to ensure that jobs in that queue will be assigned nodes with that attribute……
For example, if you are creating a queue for users of MATLAB
qmgr -c "create queue matlab" qmgr -c "set queue matlab queue_type = Execution" qmgr -c "set queue matlab resources_default.neednodes = matlab" qmgr -c "set queue matlab enabled = True" qmgr -c "set queue matlab started = True"
For those nodes, you are assigning to the queue, do update the nodes properties. A good example can be found at 3.2 Nodes Properties
To add new properties on-the-fly,
qmgr -c "set node node001 properties += matlab"
(if you are adding additional properties to the nodes)
To remove properties on-the-fly
qmgr -c "set node node001 properties -= matlab"
1. Closing a Host
# badmin hclose hostid Close hostid ...... done
2. Opening a Host
# badmin hopen hostid Open hostid ...... done
3. Log a comment when closing or opening a host
# badmin hopen -C "Re-Provisioned" hostA
# badmin hclose -C "Weekly backup" hostB
The comment text Weekly backup is recorded in lsb.events. If you close or open a host group, each host group member displays with the same comment string.
The lshosts command shows the load thresholds. Using lshosts -l
$ lshosts -l HOST_NAME: comp001 type model cpuf ncpus ndisks maxmem maxswp maxtmp rexpri server nprocs ncores nthreads X86_64 Intel_EM64T 60.0 16 1 63G 16G 352423M 0 Yes 2 8 1 RESOURCES: Not defined RUN_WINDOWS: (always open) LOAD_THRESHOLDS: r15s r1m r15m ut pg io ls it tmp swp mem root maxroot processes clockskew netcard iptotal cpuhz cachesize diskvolume processesroot ipmi powerconsumption ambienttemp cputemp - 3.5 - - - - - - - - - - - - - - - - - - - - - - -
Host status
Host status describes the ability of a host to accept and run batch jobs in terms of daemon states, load levels, and administrative controls. The bhosts and lsload commands display host status.
1. bhosts
Displays the current status of the host
| STATUS | DESCRIPTION |
| ok | Host is available to accept and run new batch jobs |
| unavail | Host is down, or LIM and sbatchd are unreachable. |
| unreach | LIM is running but sbatchd is unreachable. |
| closed | Host will not accept new jobs. Use bhosts -l to display the reasons. |
| unlicensed | Host does not have a valid license. |
2. bhosts -l
Displays the closed reasons. A closed host does not accept new batch jobs:
$ bhosts -l HOST node001 STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW closed_Adm 60.00 - 16 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem root maxroot Total 0.0 0.0 0.0 0% 0.0 0 0 28656 324G 16G 60G 3e+05 4e+05 Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M 0.0 0.0 processes clockskew netcard iptotal cpuhz cachesize diskvolume Total 404.0 0.0 2.0 2.0 1200.0 2e+04 5e+05 Reserved 0.0 0.0 0.0 0.0 0.0 0.0 0.0 processesroot ipmi powerconsumption ambienttemp cputemp Total 396.0 -1.0 -1.0 -1.0 -1.0 Reserved 0.0 0.0 0.0 0.0 0.0 aa_r aa_r_dy aa_dy_p aa_r_ad aa_r_hpc fluentall fluent fluent_nox Total 17.0 25.0 128.0 10.0 272.0 48.0 48.0 50.0 Reserved 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 gambit geom_trans tgrid fluent_par Total 50.0 50.0 50.0 193.0 Reserved 0.0 0.0 0.0 0.0
3. bhosts -X
Condensed host groups in an condensed format
$ bhosts -X HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV comp027 ok - 16 0 0 0 0 0 comp028 ok - 16 0 0 0 0 0 comp029 ok - 16 0 0 0 0 0 comp030 ok - 16 0 0 0 0 0 comp031 ok - 16 0 0 0 0 0 comp032 ok - 16 0 0 0 0 0 comp033 ok - 16 0 0 0 0 0
4. bhosts -l hostID
Display all information about specific server host such as the CPU factor and the load thresholds to start, suspend, and resume jobs
# bhosts -l comp067 HOST comp067 STATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOW ok 60.00 - 16 0 0 0 0 0 - CURRENT LOAD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem root maxroot Total 0.0 0.0 0.0 0% 0.0 0 0 13032 324G 16G 60G 3e+05 4e+05 Reserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M 0.0 0.0 processes clockskew netcard iptotal cpuhz cachesize diskvolume Total 406.0 0.0 2.0 2.0 1200.0 2e+04 5e+05 Reserved 0.0 0.0 0.0 0.0 0.0 0.0 0.0 processesroot ipmi powerconsumption ambienttemp cputemp Total 399.0 -1.0 -1.0 -1.0 -1.0 Reserved 0.0 0.0 0.0 0.0 0.0 aa_r aa_r_dy aa_dy_p aa_r_ad aa_r_hpc fluentall fluent fluent_nox Total 18.0 25.0 128.0 10.0 272.0 47.0 47.0 50.0 Reserved 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 gambit geom_trans tgrid fluent_par Total 50.0 50.0 50.0 193.0 Reserved 0.0 0.0 0.0 0.0 LOAD THRESHOLD USED FOR SCHEDULING: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - root maxroot processes clockskew netcard iptotal cpuhz cachesize loadSched - - - - - - - - loadStop - - - - - - - - diskvolume processesroot ipmi powerconsumption ambienttemp cputemp loadSched - - - - - - loadStop - - - - - -
5. lsload
[user1@login1 ~]$ lsload HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem login1 ok 0.0 0.0 0.0 1% 0.0 17 0 240G 16G 28G login2 ok 0.0 0.0 0.0 0% 0.0 0 7040 242G 16G 28G node1 ok 0.0 0.4 0.3 0% 0.0 0 31760 324G 16G 60G
Displays the current state of the host:
| STATUS | DESCRIPTION |
| ok | Host is available to accept and run batch jobs and remote tasks. |
| -ok | LIM is running but RES is unreachable. |
| busy | Does not affect batch jobs, only used for remote task placement (i.e., lsrun). The value of a load index exceeded a threshold (configured in lsf.cluster.cluster_name, displayed by lshosts -l). Indices that exceed thresholds are identified with an asterisk (*). |
| lockW | Does not affect batch jobs, only used for remote task placement (i.e., lsrun). Host is locked by a run window (configured in lsf.cluster.cluster_name, displayed by lshosts -l). |
| lockU | Will not accept new batch jobs or remote tasks. An LSF administrator or root explicitly locked the host using lsadmin limlock, or an exclusive batch job (bsub -x) is running on the host. Running jobs are not affected. Use lsadmin limunlock to unlock LIM on the local host. |
| unavail | Host is down, or LIM is unavailable. |
6. lshosts -l
The lshosts command shows the load thresholds.
$ lshosts -l HOST_NAME: comp001 type model cpuf ncpus ndisks maxmem maxswp maxtmp rexpri server nprocs ncores nthreads X86_64 Intel_EM64T 60.0 16 1 63G 16G 352423M 0 Yes 2 8 1 RESOURCES: Not defined RUN_WINDOWS: (always open) LOAD_THRESHOLDS: r15s r1m r15m ut pg io ls it tmp swp mem root maxroot processes clockskew netcard iptotal cpuhz cachesize diskvolume processesroot ipmi powerconsumption ambienttemp cputemp - 3.5 - - - - - - - - - - - - - - - - - - - - - - -
7. References:
This is a good summary taken from Adaptive Computing 4.3 Job Managament Commands to manage jobs for MAUI.
| Command | Flags | Description |
| canceljob | cancel existing job | |
| checkjob | display job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization |
|
| diagnose | -j | display summarized job information and any unexpected state |
| releasehold | [-a] | remove job holds or defers |
| runjob | start job immediately if possible | |
| sethold | set hold on job | |
| setqos | set/modify QoS of existing job | |
| setspri | adjust job/system priority of job |
If the Compute Node pbs_mom is lost and cannot be recovered (due to hardware or network failure) and to purge a running job from the qstat output or show
1. Shutdown the pbs_server daemon on the PBS Server
# service pbs_server stop
2. Remove Job Spool Files that holds the hanged JobID (For example 4444)
# rm /var/spool/torque/server_priv/jobs/4444.headnode.SC # rm /var/spool/torque/server_priv/jobs/4444.headnode.JB
3. Start the pbs_Server Daemon
# service pbs_server start
4. Restart the MAUI Daemon
# service maui restart
References:
References:
Do take a look at the Torque Admin Manual
Step 1: Download the Torque Software from Adaptive Computing
Download the Torque tarball from Torque Resource Manager Site
Step 2: Ensure you have the gcc, libssl-devel, and libxml2-devel packages
# yum install libxml2-devel openssl-devel gcc gcc-c++
Step 3: Configure the Torque Server
./configure \ --prefix=/opt/torque \ --exec-prefix=/opt/torque/x86_64 \ --enable-docs \ --disable-gui \ --with-server-home=/var/spool/torque \ --enable-syslog \ --with-scp \ --disable-rpp \ --disable-spool \ --enable-gcc-warnings \ --with-pam
Step 4: Compile the Torque
# make -j8 # make install
Step 5: Configure the trqauthd daemon to start automatically at system boot for the PBS Server
# cp contrib/init.d/trqauthd /etc/init.d/ # chkconfig --add trqauthd # echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf # ldconfig # service trqauthd start
Step 6: Copy the pbs_server and pbs_sched daemon for the PBS Server
# cp contrib/init.d/pbs_server /etc/init.d/pbs_server # cp contrib/init.d/pb_sched /etc/init.d/pbs_sched
Step 6: Initialize serverdb by executing the torque.setup script for the PBS Server
# ./torque.setup root
Step 7: Make self-extracting tarballs packages for Client Nodes
# make packages Building ./torque-package-clients-linux-i686.sh ... Building ./torque-package-mom-linux-i686.sh ... Building ./torque-package-server-linux-i686.sh ... Building ./torque-package-gui-linux-i686.sh ... Building ./torque-package-devel-linux-i686.sh ... Done
Step 7b. Run libtool –finish /opt/torque/x86_64/lib
libtool: finish: PATH="/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/share/xcat/tools:/usr/lib64/qt-3.3/bin:/usr/local/intel/composer_xe_2011_sp1.11.339/bin/intel64:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/ibutils/bin:/usr/local/intel/composer_xe_2011_sp1.11.339/mpirt/bin/intel64:/opt/maui/bin:/opt/torque/x86_64/bin:/root/bin:/sbin" ldconfig -n /opt/torque/x86_64/lib ---------------------------------------------------------------------- Libraries have been installed in: /opt/torque/x86_64/lib If you ever happen to want to link against installed libraries in a given directory, LIBDIR, you must either use libtool, and specify the full pathname of the library, or use the `-LLIBDIR' flag during linking and do at least one of the following: - add LIBDIR to the `LD_LIBRARY_PATH' environment variable during execution - add LIBDIR to the `LD_RUN_PATH' environment variable during linking - use the `-Wl,-rpath -Wl,LIBDIR' linker flag - have your system administrator add LIBDIR to `/etc/ld.so.conf' See any operating system documentation about shared libraries for more information, such as the ld(1) and ld.so(8) manual pages. ----------------------------------------------------------------------
Step 8a: Copy and install on the Client Nodes
for i in node01 node02 node03 node04 ; do scp torque-package-mom-linux-i686.sh ${i}:/tmp/. ; done
for i in node01 node02 node03 node04 ; do scp torque-package-clients-linux-i686.sh ${i}:/tmp/. ; done
for i in node01 node02 node03 node04 ; do ssh ${i} /tmp/torque-package-mom-linux-i686.sh --install ; done
for i in node01 node02 node03 node04 ; do ssh ${i} /tmp/torque-package-clients-linux-i686.sh --install ; done
Step 8b: Alternatively, you can use xCAT to push and run the packages from the PBS Server to the Client Node (auuming you install XCAT on the PBS Server)
# pscp torque-package-mom-linux-i686.sh compute_noderange:/tmp # pscp torque-package-clients-linux-i686.sh compute_noderange:/tmp # psh compute_noderange:/tmp/torque-package-mom-linux-i686.sh # psh compute_noderange:/tmp/torque-package-clients-linux-i686.sh
Step 9: Enabling Torque as a service for the Client Node
# cp contrib/init.d/pbs_mom /etc/init.d/pbs_mom # chkconfig --add pbs_mom
Step 10a: Start the Services for each of the client nodes
# service pbs_mom start
Step 10b: Alternatively, Use XCAT to start the service for all the Client Node
# psh compute_noderange "/sbin/service/pbs_mom start"