Configuring Submission Node for Torque 4.2.7

If you are planning to have more nodes where the users can do submission apart from the Head Node of the Cluster, you may want to configure a Submission Node. By default, TORQUE only allow one submission node. There are 2 ways to configure this submission node. One way is using the “submit_hosts paramter” in the Torque Server.

 

Step 1a: Configuring the Submission

First and Foremost, one of the main prerequisites is that the submission nodes must be part of the resource pool identified by the Torque Server. If  you are not part of the Torque Server, you may want to follow the steps to make the to-be-submission node part of the resource pool or a pbs_mom client. You can check the setup by looking at the Installing Torque 4.2.5 on CentOS 6. Configuring the TORQUE Clients. You might want to follow up with this optional setup Adding and Specifying Compute Resources at Torque to make sure your cores count are correct.

Step 1b: Ensure the exchange keys between submission node and Torque Server

For more information, see Auto SSH Login without Password

Step 1c: Configure the submission node as a non-default queue (Optional)

For more information, see Using Torque to set up a Queue to direct users to a subset of resources

Step 2: Registering the Submission Node in Torque

If you do not wish the compute node to be a compute resource, you can put a non-default queue or unique queue which users  will  not send to.

Once you have configured the to-be-submission node as one of the client, you have to now to configure the torque server by this commands.

# qmgr -c 'set server submit_hosts = hostname1'
# qmgr -c 'set server allow_node_submit = True'

Step 3: Putting Submission Node inside Torque Server /etc/hosts.equiv

# vim /etc/hosts.equiv
submission_node.cluster.com

Step 4a: Copy trqauthd from primary submission node to the secondary submission node

# scp -v /etc/init.d/trqauthd root@submission_node.cluster.com:/etc/init.d

Step 4b: Start the trqauthd service on the submission node

# service trqauthd start

Step 5: Test the configuration

Do a

$ qsub -I nodes=1:ppn=8

You should see from the torque server that the job has been submitted via the submission node by doing a qstat -an

$ qstat -an

Step 6: Mount Maui Information from PBS/MAUI Server

From the MAUI Server, do a NFS, mount the configuration and binaries of MAUI

Edit /etc/exports

/opt/maui               Submission-Node1(rw,no_root_squash,async,no_subtree_check) 
/usr/local/maui         Submission-Node1(rw,no_root_squash,async,no_subtree_check)

At the MAUI Server, restart NFS Services

# service restart nfs

At the submission node, make sure you have the mount point /opt/maui and /usr/local/maui for the

At /etc/fstab, mount the file system and restart netfs

head-node1:/usr/local/maui    /usr/local/maui        nfs      defaults  0 0
head-node1:/opt/maui          /opt/maui              nfs      defaults  0 0
#service netfs restart

Resources:

  1. Torque Server document torqueAdminGuide-4.2.7
  2. Bad UID for job execution MSG=ruserok failed validating user1 from ServerNode while configuring Submission Node in Torque
  3. Unable to Submit via Torque Submission Node – Socket_Connect Error for Torque 4.2.7

Using Torque to set up a Queue to direct users to a subset of resources

If you are running clusters, you may want to set up a queue to direct users to a subset of resources with Torque. For example, I may wish to direct a users who needs specific resources like MATLAB to a particular queue.

More information can be found at Torque Documents 4.1 “4.1.4 Mapping a Queue to a subset of Resources


….The simplest method is using default_resources.neednodes on an execution queue, setting it to a particular node attribute. Maui/Moab will use this information to ensure that jobs in that queue will be assigned nodes with that attribute…… 

For example, if you are creating a queue for users of MATLAB

qmgr -c "create queue matlab"
qmgr -c "set queue matlab queue_type = Execution"
qmgr -c "set queue matlab resources_default.neednodes = matlab"
qmgr -c "set queue matlab enabled = True"
qmgr -c "set queue matlab started = True"

For those nodes, you are assigning to the queue, do update the nodes properties. A good example can be found at 3.2 Nodes Properties

To add new properties on-the-fly,

qmgr -c "set node node001 properties += matlab"

(if you are adding additional properties to the nodes)

To remove properties on-the-fly

qmgr -c "set node node001 properties -= matlab"

Deleting PBS and MAUI Jobs which cannot be purged

 If the Compute Node pbs_mom is lost and cannot be recovered (due to hardware or network failure) and to purge a running job from the qstat output or show

1. Shutdown the pbs_server daemon on the PBS Server

# service pbs_server stop

2. Remove Job Spool Files that holds the hanged JobID (For example 4444)

# rm /var/spool/torque/server_priv/jobs/4444.headnode.SC
# rm /var/spool/torque/server_priv/jobs/4444.headnode.JB

3. Start the pbs_Server Daemon

# service pbs_server start

4. Restart the MAUI Daemon

# service maui restart

References:

  1. Deleting PBS/Maui Jobs

Installing Torque 4.2.5 on CentOS 6

References:

Do take a look at the Torque Admin Manual

Step 1: Download the Torque Software from Adaptive Computing

Download the Torque tarball from Torque Resource Manager Site

Step 2: Ensure you have the gcc, libssl-devel, and libxml2-devel packages

# yum install libxml2-devel openssl-devel gcc gcc-c++

Step 3: Configure the Torque Server

./configure \
--prefix=/opt/torque \
--exec-prefix=/opt/torque/x86_64 \
--enable-docs \
--disable-gui \
--with-server-home=/var/spool/torque \
--enable-syslog \
--with-scp \
--disable-rpp \
--disable-spool \
--enable-gcc-warnings \
--with-pam

Step 4: Compile the Torque

# make -j8
# make install

Step 5: Configure the trqauthd daemon to start automatically at system boot for the PBS Server

# cp contrib/init.d/trqauthd /etc/init.d/
# chkconfig --add trqauthd
# echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf
# ldconfig
# service trqauthd start

Step 6: Copy the pbs_server and pbs_sched daemon for the PBS Server

# cp contrib/init.d/pbs_server /etc/init.d/pbs_server
# cp contrib/init.d/pb_sched /etc/init.d/pbs_sched

Step 6: Initialize serverdb by executing the torque.setup script for the PBS Server

# ./torque.setup root

Step 7: Make self-extracting tarballs packages for Client Nodes

# make packages
Building ./torque-package-clients-linux-i686.sh ...
Building ./torque-package-mom-linux-i686.sh ...
Building ./torque-package-server-linux-i686.sh ...
Building ./torque-package-gui-linux-i686.sh ...
Building ./torque-package-devel-linux-i686.sh ...
Done

Step 7b. Run libtool –finish /opt/torque/x86_64/lib

libtool: finish: PATH="/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/share/xcat/tools:/usr/lib64/qt-3.3/bin:/usr/local/intel/composer_xe_2011_sp1.11.339/bin/intel64:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/ibutils/bin:/usr/local/intel/composer_xe_2011_sp1.11.339/mpirt/bin/intel64:/opt/maui/bin:/opt/torque/x86_64/bin:/root/bin:/sbin" ldconfig -n /opt/torque/x86_64/lib
----------------------------------------------------------------------
Libraries have been installed in:
/opt/torque/x86_64/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------

 

Step 8a: Copy and install on the Client Nodes

for i in node01 node02 node03 node04 ; do scp torque-package-mom-linux-i686.sh ${i}:/tmp/. ; done
for i in node01 node02 node03 node04 ; do scp torque-package-clients-linux-i686.sh ${i}:/tmp/. ; done
for i in node01 node02 node03 node04 ; do ssh ${i} /tmp/torque-package-mom-linux-i686.sh --install ; done
for i in node01 node02 node03 node04 ; do ssh ${i} /tmp/torque-package-clients-linux-i686.sh --install ; done

Step 8b: Alternatively, you can use xCAT to push and run the packages from the PBS Server to the Client Node (auuming you install XCAT on the PBS Server)

# pscp  torque-package-mom-linux-i686.sh compute_noderange:/tmp
# pscp torque-package-clients-linux-i686.sh compute_noderange:/tmp
# psh compute_noderange:/tmp/torque-package-mom-linux-i686.sh
# psh compute_noderange:/tmp/torque-package-clients-linux-i686.sh

Step 9: Enabling Torque as a service for the Client Node

# cp contrib/init.d/pbs_mom /etc/init.d/pbs_mom
# chkconfig --add pbs_mom

Step 10a: Start the Services for each of the client nodes

# service pbs_mom start

Step 10b: Alternatively, Use XCAT to start the service for all the Client Node

# psh compute_noderange "/sbin/service/pbs_mom start"

Adding and Specifying Compute Resources at Torque

This blog entry is the follow-up of Installing Torque 2.5 on CentOS 6 with xCAT tool.

After installing of Torque on the Head Node and Compute Node, the next things to do is to configure the Torque Server. In this blog entry, I will focus on the Configuring the Compute Resources at Torque Server

 Step 1: Adding Nodes to the Torque Server

# qmgr -c "create node node01"

Step 2: Configure Auto-Detect Nodes CPU Detection. Setting auto_node_np to TRUE overwrites the value of np set in $TORQUEHOME/server_priv/nodes

# qmgr -c "set server auto_node_np = True"

Step 3: Start the pbs_mom of the compute nodes, the torque server will detect the nodes automatically

# service pbs_mom start

Using pam_pbssimpleauth.so to authorise login for users for Torque

For a cluster shared by many users, it is important to prevent errant users from directly ssh into the compute nodes, thus bypassing the scheduler. To implement the pam module, compile the Torque Server based on Installing Torque 2.5 on CentOS 6

Step 1: You should be able to find the pam_pbssimpleauth.so packages at

$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.a
$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.la
$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.so

Step 2: Copy the  pam_pbssimpleauth.so to the compute nodes. Step 2b: DO not put the pam_pbssimpleauth.so in on the Head Node

# scp $TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.so node1:/lib64/security/

Step 3: Verify that the access.so is also present in the /lib64/security/ directory

# ls /lib64/security/access.so

Step 4: Add the access.so and pam_pbssimpleauth.so in the PAM configuration files

# vim /etc/pam.d/sshd
auth       required     pam_sepermit.so
auth       include      password-auth
account    required     pam_nologin.so

account    required     pam_pbssimpleauth.so
account    required     pam_access.so

account    include      password-auth
password   include      password-auth
.....
.....

When a user ssh’s to a node, this module will check the .JB files in $PBS_SERVER_HOME/mom_priv/jobs/ for a matching uid and that the job is running.

You can try the configuration

PBS scripts for mpirun parameters for Chelsio / Infiniband Cards

If you are running Chelsio Cards, you  may want to specify the mpirun parameters to ensure the

/usr/mpi/intel/openmpi-1.4.3/bin/mpirun 
-mca btl openib,sm,self --bind-to-core 
--report-bindings -np $NCPUS -machinefile $PBS_NODEFILE $PBS_O_WORKDIR/$file

–bind-to-core: Bind each MPI process to a core
–mca btl openib,sm,self: (Infiniband, shared memory, the loopback)

For information on Interprocess communication with shared memory,

  1. see Speaking UNIX: Interprocess communication with shared memory

Sequential execution of the Parallel or serial jobs on OpenPBS / Torque

If you have an requirement to execute jobs in sequence after the 1st job has completed, only then the 2nd job can launch, you can  use the -W command

For example, if you have a running job with a job ID 12345, and you run the next job to run only after job 12345 run.

$ qsub -q clusterqueue -l nodes=1:ppn=8 -W depend=afterany:12345 parallel.sh -v file=mybinaryfile

You will notice that the job will hold until the 1st job executed.

.....
.....
24328 kittycool Hold 2 10:00:00:00 Sat Mar 2 02:20:12

Handling inputs flies on PBS

Single Input File (Serial run)

If you requires to run your job(s) over different 1 set input files, you have to add the line script into your PBS. Suppose the single input file is data1.ini. The program above will take the input data1.ini and generates an output file data1.out. In your submission scripts

#!/bin/bash
.....
.....
## Use data1.ini as the input file for $file
cd $PBS_O_WORKDIR
mybinaryprogram < data1.ini 1> data1.out 2> data1.err
.....
.....

Multiple Input Files (Serial run)

IF you wish to use multiple input files , you should use the PBS job array parameters. This can be expressed with the -t parameters. The -t option allows many copies of the same script to be queued all at once. You can use the PBS_ARRAYID to differenciate between the different jobs in the array.

Assuming the data file

data1.ini
data2.ini
data3.ini
data4.ini
data5.ini
data6.ini
data7.ini
data8.ini
#!/bin/bash
.....
.....
#PBS -t 1-8
cd $PBS_O_WORKDIR
mybinaryprogram < data${PBS_ARRAYID}.in 1> data${PBS_ARRAYID}.out 2> data${PBS_ARRAYID}.err
.....
.....

For above information is obtained from

  1. Michigan State University HPCC “Advanced Scripting Using PBS Environment Variables”