Configuring Submission Node for Torque 4.2.7

If you are planning to have more nodes where the users can do submission apart from the Head Node of the Cluster, you may want to configure a Submission Node. By default, TORQUE only allow one submission node. There are 2 ways to configure this submission node. One way is using the “submit_hosts paramter” in the Torque Server.

 

Step 1a: Configuring the Submission

First and Foremost, one of the main prerequisites is that the submission nodes must be part of the resource pool identified by the Torque Server. If  you are not part of the Torque Server, you may want to follow the steps to make the to-be-submission node part of the resource pool or a pbs_mom client. You can check the setup by looking at the Installing Torque 4.2.5 on CentOS 6. Configuring the TORQUE Clients. You might want to follow up with this optional setup Adding and Specifying Compute Resources at Torque to make sure your cores count are correct.

Step 1b: Ensure the exchange keys between submission node and Torque Server

For more information, see Auto SSH Login without Password

Step 1c: Configure the submission node as a non-default queue (Optional)

For more information, see Using Torque to set up a Queue to direct users to a subset of resources

Step 2: Registering the Submission Node in Torque

If you do not wish the compute node to be a compute resource, you can put a non-default queue or unique queue which users  will  not send to.

Once you have configured the to-be-submission node as one of the client, you have to now to configure the torque server by this commands.

# qmgr -c 'set server submit_hosts = hostname1'
# qmgr -c 'set server allow_node_submit = True'

Step 3: Putting Submission Node inside Torque Server /etc/hosts.equiv

# vim /etc/hosts.equiv
submission_node.cluster.com

Step 4a: Copy trqauthd from primary submission node to the secondary submission node

# scp -v /etc/init.d/trqauthd root@submission_node.cluster.com:/etc/init.d

Step 4b: Start the trqauthd service on the submission node

# service trqauthd start

Step 5: Test the configuration

Do a

$ qsub -I nodes=1:ppn=8

You should see from the torque server that the job has been submitted via the submission node by doing a qstat -an

$ qstat -an

Step 6: Mount Maui Information from PBS/MAUI Server

From the MAUI Server, do a NFS, mount the configuration and binaries of MAUI

Edit /etc/exports

/opt/maui               Submission-Node1(rw,no_root_squash,async,no_subtree_check) 
/usr/local/maui         Submission-Node1(rw,no_root_squash,async,no_subtree_check)

At the MAUI Server, restart NFS Services

# service restart nfs

At the submission node, make sure you have the mount point /opt/maui and /usr/local/maui for the

At /etc/fstab, mount the file system and restart netfs

head-node1:/usr/local/maui    /usr/local/maui        nfs      defaults  0 0
head-node1:/opt/maui          /opt/maui              nfs      defaults  0 0
#service netfs restart

Resources:

  1. Torque Server document torqueAdminGuide-4.2.7
  2. Bad UID for job execution MSG=ruserok failed validating user1 from ServerNode while configuring Submission Node in Torque
  3. Unable to Submit via Torque Submission Node – Socket_Connect Error for Torque 4.2.7

Using Torque to set up a Queue to direct users to a subset of resources

If you are running clusters, you may want to set up a queue to direct users to a subset of resources with Torque. For example, I may wish to direct a users who needs specific resources like MATLAB to a particular queue.

More information can be found at Torque Documents 4.1 “4.1.4 Mapping a Queue to a subset of Resources


….The simplest method is using default_resources.neednodes on an execution queue, setting it to a particular node attribute. Maui/Moab will use this information to ensure that jobs in that queue will be assigned nodes with that attribute…… 

For example, if you are creating a queue for users of MATLAB

qmgr -c "create queue matlab"
qmgr -c "set queue matlab queue_type = Execution"
qmgr -c "set queue matlab resources_default.neednodes = matlab"
qmgr -c "set queue matlab enabled = True"
qmgr -c "set queue matlab started = True"

For those nodes, you are assigning to the queue, do update the nodes properties. A good example can be found at 3.2 Nodes Properties

To add new properties on-the-fly,

qmgr -c "set node node001 properties += matlab"

(if you are adding additional properties to the nodes)

To remove properties on-the-fly

qmgr -c "set node node001 properties -= matlab"

Platform LSF – Controlling Hosts

1. Closing a Host

# badmin hclose hostid
Close hostid ...... done

2. Opening a Host

# badmin hopen hostid
Open hostid ...... done

3. Log a comment when closing or opening a host

# badmin hopen -C "Re-Provisioned" hostA
# badmin hclose -C "Weekly backup" hostB

The comment text Weekly backup is recorded in lsb.events. If you close or open a host group, each host group member displays with the same comment string.

Platform LSF – Working with Hosts (lshosts, lsmon)

The lshosts command shows the load thresholds. Using lshosts -l

$ lshosts -l
HOST_NAME:  comp001
type             model  cpuf ncpus ndisks maxmem maxswp maxtmp rexpri server nprocs ncores nthreads
X86_64     Intel_EM64T  60.0    16      1    63G    16G 352423M      0    Yes      2      8        1

RESOURCES: Not defined
RUN_WINDOWS:  (always open)

LOAD_THRESHOLDS:
r15s   r1m  r15m   ut    pg    io   ls   it   tmp   swp   mem   root maxroot processes clockskew netcard iptotal  cpuhz cachesize diskvolume processesroot   ipmi powerconsumption ambienttemp cputemp
-   3.5     -    -     -     -    -    -     -     -     -      -       -         -         -       -       -      -         -          -             -      -                -           -       -

Platform LSF – Working with Hosts (bhost, lsload, lsmon)

Host status

Host status describes the ability of a host to accept and run batch jobs in terms of daemon states, load levels, and administrative controls. The bhosts and lsload commands display host status.

 

1. bhosts
Displays the current status of the host

STATUS DESCRIPTION
ok  Host is available to accept and run new batch jobs
unavail  Host is down, or LIM and sbatchd are unreachable.
unreach  LIM is running but sbatchd is unreachable.
closed  Host will not accept new jobs. Use bhosts -l to display the reasons.
unlicensed Host does not have a valid license.

 

2. bhosts -l
Displays the closed reasons. A closed host does not accept new batch jobs:

$ bhosts -l
HOST  node001
STATUS           CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
closed_Adm      60.00     -     16      0      0      0      0      0      -

CURRENT LOAD USED FOR SCHEDULING:
r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem   root maxroot
Total           0.0   0.0   0.0    0%   0.0     0    0 28656  324G   16G   60G  3e+05   4e+05
Reserved        0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M    0M    0.0     0.0

processes clockskew netcard iptotal  cpuhz cachesize diskvolume
Total             404.0       0.0     2.0     2.0 1200.0     2e+04      5e+05
Reserved            0.0       0.0     0.0     0.0    0.0       0.0        0.0

processesroot   ipmi powerconsumption ambienttemp cputemp
Total                 396.0   -1.0             -1.0        -1.0    -1.0
Reserved                0.0    0.0              0.0         0.0     0.0


aa_r aa_r_dy aa_dy_p aa_r_ad aa_r_hpc fluentall fluent fluent_nox
Total         17.0    25.0   128.0    10.0    272.0      48.0   48.0       50.0
Reserved       0.0     0.0     0.0     0.0      0.0       0.0    0.0        0.0

gambit geom_trans tgrid fluent_par
Total           50.0       50.0  50.0      193.0
Reserved         0.0        0.0   0.0        0.0

 

3. bhosts -X

Condensed host groups in an condensed format

$ bhosts -X
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
comp027            ok              -     16      0      0      0      0      0
comp028            ok              -     16      0      0      0      0      0
comp029            ok              -     16      0      0      0      0      0
comp030            ok              -     16      0      0      0      0      0
comp031            ok              -     16      0      0      0      0      0
comp032            ok              -     16      0      0      0      0      0
comp033            ok              -     16      0      0      0      0      0

 

4. bhosts -l hostID

Display all information about specific server host such as the CPU factor and the load thresholds to start, suspend, and resume jobs

# bhosts -l comp067
HOST  comp067
STATUS           CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
ok              60.00     -     16      0      0      0      0      0      -

CURRENT LOAD USED FOR SCHEDULING:
r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem   root maxroot
Total           0.0   0.0   0.0    0%   0.0     0    0 13032  324G   16G   60G  3e+05   4e+05
Reserved        0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M    0M    0.0     0.0

processes clockskew netcard iptotal  cpuhz cachesize diskvolume
Total             406.0       0.0     2.0     2.0 1200.0     2e+04      5e+05
Reserved            0.0       0.0     0.0     0.0    0.0       0.0        0.0

processesroot   ipmi powerconsumption ambienttemp cputemp
Total                 399.0   -1.0             -1.0        -1.0    -1.0
Reserved                0.0    0.0              0.0         0.0     0.0

aa_r aa_r_dy aa_dy_p aa_r_ad aa_r_hpc fluentall fluent fluent_nox
Total         18.0    25.0   128.0    10.0    272.0      47.0   47.0       50.0
Reserved       0.0     0.0     0.0     0.0      0.0       0.0    0.0        0.0

gambit geom_trans tgrid fluent_par
Total           50.0       50.0  50.0      193.0
Reserved         0.0        0.0   0.0        0.0

LOAD THRESHOLD USED FOR SCHEDULING:
r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
loadSched   -     -     -     -       -     -    -     -     -      -      -
loadStop    -     -     -     -       -     -    -     -     -      -      -

root maxroot processes clockskew netcard iptotal   cpuhz cachesize
loadSched     -       -         -         -       -       -       -         -
loadStop      -       -         -         -       -       -       -         -

diskvolume processesroot    ipmi powerconsumption ambienttemp cputemp
loadSched        -             -       -                -           -       -
loadStop         -             -       -                -           -       -

 

5. lsload

[user1@login1 ~]$ lsload
HOST_NAME       status  r15s   r1m  r15m   ut    pg  ls    it   tmp   swp   mem
login1          ok   0.0   0.0   0.0   1%   0.0  17     0  240G   16G   28G
login2          ok   0.0   0.0   0.0   0%   0.0   0  7040  242G   16G   28G
node1           ok   0.0   0.4   0.3   0%   0.0   0 31760  324G   16G   60G

Displays the current state of the host:

STATUS DESCRIPTION
ok Host is available to accept and run batch jobs and remote tasks.
-ok LIM is running but RES is unreachable.
busy Does not affect batch jobs, only used for remote task placement (i.e., lsrun). The value of a load index exceeded a threshold (configured in lsf.cluster.cluster_name, displayed by lshosts -l). Indices that exceed thresholds are identified with an asterisk (*).
lockW Does not affect batch jobs, only used for remote task placement (i.e., lsrun). Host is locked by a run window (configured in lsf.cluster.cluster_name, displayed by lshosts -l).
lockU Will not accept new batch jobs or remote tasks. An LSF administrator or root explicitly locked the host using lsadmin limlock, or an exclusive batch job (bsub -x) is running on the host. Running jobs are not affected. Use lsadmin limunlock to unlock LIM on the local host.
unavail Host is down, or LIM is unavailable.

 

6. lshosts -l
The lshosts command shows the load thresholds.

$ lshosts -l
HOST_NAME:  comp001
type             model  cpuf ncpus ndisks maxmem maxswp maxtmp rexpri server nprocs ncores nthreads
X86_64     Intel_EM64T  60.0    16      1    63G    16G 352423M      0    Yes      2      8        1

RESOURCES: Not defined
RUN_WINDOWS:  (always open)

LOAD_THRESHOLDS:
r15s   r1m  r15m   ut    pg    io   ls   it   tmp   swp   mem   root maxroot processes clockskew netcard iptotal  cpuhz cachesize diskvolume processesroot   ipmi powerconsumption ambienttemp cputemp
-   3.5     -    -     -     -    -    -     -     -     -      -       -         -         -       -       -      -         -          -             -      -                -           -       -

 

7. References:

  1. Platform – Working with hosts

Summary of Job Management Commands for MAUI

This is a good summary taken from Adaptive Computing 4.3 Job Managament Commands to manage jobs for MAUI.

Command Flags Description
canceljob cancel existing job
checkjob display job state, resource requirements, environment, constraints,
credentials, history, allocated resources, and resource utilization
diagnose -j display summarized job information and any unexpected state
releasehold [-a] remove job holds or defers
runjob start job immediately if possible
sethold set hold on job
setqos set/modify QoS of existing job
setspri adjust job/system priority of job

Deleting PBS and MAUI Jobs which cannot be purged

 If the Compute Node pbs_mom is lost and cannot be recovered (due to hardware or network failure) and to purge a running job from the qstat output or show

1. Shutdown the pbs_server daemon on the PBS Server

# service pbs_server stop

2. Remove Job Spool Files that holds the hanged JobID (For example 4444)

# rm /var/spool/torque/server_priv/jobs/4444.headnode.SC
# rm /var/spool/torque/server_priv/jobs/4444.headnode.JB

3. Start the pbs_Server Daemon

# service pbs_server start

4. Restart the MAUI Daemon

# service maui restart

References:

  1. Deleting PBS/Maui Jobs

Installing Torque 4.2.5 on CentOS 6

References:

Do take a look at the Torque Admin Manual

Step 1: Download the Torque Software from Adaptive Computing

Download the Torque tarball from Torque Resource Manager Site

Step 2: Ensure you have the gcc, libssl-devel, and libxml2-devel packages

# yum install libxml2-devel openssl-devel gcc gcc-c++

Step 3: Configure the Torque Server

./configure \
--prefix=/opt/torque \
--exec-prefix=/opt/torque/x86_64 \
--enable-docs \
--disable-gui \
--with-server-home=/var/spool/torque \
--enable-syslog \
--with-scp \
--disable-rpp \
--disable-spool \
--enable-gcc-warnings \
--with-pam

Step 4: Compile the Torque

# make -j8
# make install

Step 5: Configure the trqauthd daemon to start automatically at system boot for the PBS Server

# cp contrib/init.d/trqauthd /etc/init.d/
# chkconfig --add trqauthd
# echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf
# ldconfig
# service trqauthd start

Step 6: Copy the pbs_server and pbs_sched daemon for the PBS Server

# cp contrib/init.d/pbs_server /etc/init.d/pbs_server
# cp contrib/init.d/pb_sched /etc/init.d/pbs_sched

Step 6: Initialize serverdb by executing the torque.setup script for the PBS Server

# ./torque.setup root

Step 7: Make self-extracting tarballs packages for Client Nodes

# make packages
Building ./torque-package-clients-linux-i686.sh ...
Building ./torque-package-mom-linux-i686.sh ...
Building ./torque-package-server-linux-i686.sh ...
Building ./torque-package-gui-linux-i686.sh ...
Building ./torque-package-devel-linux-i686.sh ...
Done

Step 7b. Run libtool –finish /opt/torque/x86_64/lib

libtool: finish: PATH="/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/share/xcat/tools:/usr/lib64/qt-3.3/bin:/usr/local/intel/composer_xe_2011_sp1.11.339/bin/intel64:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/ibutils/bin:/usr/local/intel/composer_xe_2011_sp1.11.339/mpirt/bin/intel64:/opt/maui/bin:/opt/torque/x86_64/bin:/root/bin:/sbin" ldconfig -n /opt/torque/x86_64/lib
----------------------------------------------------------------------
Libraries have been installed in:
/opt/torque/x86_64/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------

 

Step 8a: Copy and install on the Client Nodes

for i in node01 node02 node03 node04 ; do scp torque-package-mom-linux-i686.sh ${i}:/tmp/. ; done
for i in node01 node02 node03 node04 ; do scp torque-package-clients-linux-i686.sh ${i}:/tmp/. ; done
for i in node01 node02 node03 node04 ; do ssh ${i} /tmp/torque-package-mom-linux-i686.sh --install ; done
for i in node01 node02 node03 node04 ; do ssh ${i} /tmp/torque-package-clients-linux-i686.sh --install ; done

Step 8b: Alternatively, you can use xCAT to push and run the packages from the PBS Server to the Client Node (auuming you install XCAT on the PBS Server)

# pscp  torque-package-mom-linux-i686.sh compute_noderange:/tmp
# pscp torque-package-clients-linux-i686.sh compute_noderange:/tmp
# psh compute_noderange:/tmp/torque-package-mom-linux-i686.sh
# psh compute_noderange:/tmp/torque-package-clients-linux-i686.sh

Step 9: Enabling Torque as a service for the Client Node

# cp contrib/init.d/pbs_mom /etc/init.d/pbs_mom
# chkconfig --add pbs_mom

Step 10a: Start the Services for each of the client nodes

# service pbs_mom start

Step 10b: Alternatively, Use XCAT to start the service for all the Client Node

# psh compute_noderange "/sbin/service/pbs_mom start"