Running multiple copies of the same job at the same time on PBS

In some situation, you may have to run a job several times. This is true in situation of random number generators.

Single Copy

#!/bin/bash
.....
.....
## Use data1.ini as the input file for $file
cd $PBS_O_WORKDIR
mybinaryprogram 1> mybinaryprogram.out 2> mybinaryprogram.err
.....
.....

Multiple Copies

If we qsub the job more than once, the output will override the results from the previous jobs. You can use the PBS environment PBS_JOBID to create directory and redirect your output to the respective directory

#!/bin/bash
.....
.....
cd $PBS_O_WORKDIR
mkdir $PBS_JOBID
mybinaryprogram 1> $PBS_JOBID/mybinaryprogram.out 2> $PBS_JOBID/mybinaryprogram.err
.....
.....

This should prevent the output overwriting itself.

For above information is obtained from

  1. Michigan State University HPCC “Advanced Scripting Using PBS Environment Variables”

Sample PBS Scripts for R

Here is a sample of PBS Scripts that can be used for R. This is just a suggested PBS script. Modify and comment at will. The script below is named as R.sh

#!/bin/bash
#PBS -N R-job
#PBS -j oe
#PBS -V
#PBS -m bea
#PBS -M myemail@hotmail.com
#PBS -l nodes=1:ppn=8

# comment these out if you wish
echo "qsub host = " $PBS_O_HOST
echo "original queue = " $PBS_O_QUEUE
echo "qsub working directory absolute = " $PBS_O_WORKDIR
echo "pbs environment = " $PBS_ENVIRONMENT
echo "pbs batch = " $PBS_JOBID
echo "pbs job name from me = " $PBS_JOBNAME
echo "Name of file containing nodes = " $PBS_NODEFILE
echo "contents of nodefile = " $PBS_NODEFILE
echo "Name of queue to which job went = " $PBS_QUEUE

# Pre-processing script
cd $PBS_O_WORKDIR
NCPUS=`cat $PBS_NODEFILE | wc -l`
echo "Number of requested processors = " $NCPUS

# Load R Module
module load mpi/intel_1.4.3
module load intel/12.0.2
module load R/R-2.15.1

# ###############
# Execute Program
# ################
/usr/local/R-2.15.1/bin/R CMD BATCH $file

The corresponding qsub command and its parameter should be something like

$ qsub -q dqueue -l nodes=1:ppn=8 R.sh -v file=Rjob.r

Sample PBS Scripts for MATLAB

Here is a sample of PBS Scripts that can be used for MATLAB. This is just a suggested PBS script. Modify and comment at will. The script below is named as matlab_serial.sh

#!/bin/bash
#PBS -N MATLAB_Serial
#PBS -j oe
#PBS -V
#PBS -m bea
#PBS -M myemail@hotmail.com
#PBS -l nodes=1:ppn=1

# comment these out if you wish
echo "qsub host = " $PBS_O_HOST
echo "original queue = " $PBS_O_QUEUE
echo "qsub working directory absolute = " $PBS_O_WORKDIR
echo "pbs environment = " $PBS_ENVIRONMENT
echo "pbs batch = " $PBS_JOBID
echo "pbs job name from me = " $PBS_JOBNAME
echo "Name of file containing nodes = " $PBS_NODEFILE
echo "contents of nodefile = " $PBS_NODEFILE
echo "Name of queue to which job went = " $PBS_QUEUE

## pre-processing script
cd $PBS_O_WORKDIR
NCPUS=`cat $PBS_NODEFILE | wc -l`
echo "Number of requested processors = " $NCPUS

# Load MATLAB Module
module load intel/12.0.2
module load matlab/R2011b

cd $PBS_O_WORKDIR
/usr/local/MATLAB/R2011b/bin/matlab -nodisplay -r $file

The corresponding qsub command and its parameter should be something like

$ qsub -q dqueue -l nodes=1:ppn=8 matlab_serial.sh -v file=yourmatlabfile.m

Configuring the Torque Default Queue

Here are the sample Torque Queue configuration

qmgr -c "create queue dqueue"
qmgr -c "set queue dqueue queue_type = Execution"
qmgr -c "set queue dqueue resources_default.neednodes = dqueue"
qmgr -c "set queue dqueue enabled = True"
qmgr -c "set queue dqueue started = True"

qmgr -c "set server scheduling = True"
qmgr -c "set server acl_hosts = headnode.com"
qmgr -c "set server default_queue = dqueue"
qmgr -c "set server log_events = 127"
qmgr -c "set server mail_from = Cluster_Admin"
qmgr -c "set server query_other_jobs = True"
qmgr -c "set server resources_default.walltime = 240:00:00"
qmgr -c "set server resources_max.walltime = 720:00:00"
qmgr -c "set server scheduler_iteration = 60"
qmgr -c "set server node_check_rate = 150"
qmgr -c "set server tcp_timeout = 6"
qmgr -c "set server node_pack = False"
qmgr -c "set server mom_job_sync = True"
qmgr -c "set server keep_completed = 300"
qmgr -c "set server submit_hosts = headnode1.com"
qmgr -c "set server submit_hosts += headnode2.com"
qmgr -c "set server allow_node_submit = True"
qmgr -c "set server auto_node_np = True"
qmgr -c "set server next_job_number = 21293"

Quick method for estimating walltime for Torque Resource Manager

For Torque / OpenPBS or any other scheduler, walltime is a important parameter to allow the scheduler to determine how long the jobs will take. You can do a quick rough estimate by using the command time

# time -p mpirun -np 16 --host node1,node2 hello_world_mpi
real 4.31
user 0.04
sys 0.01

Use the value of 4:31 as the estimate walltime. Since this is an estimate, you may want to place a higher value in the walltime

$ qsub -l walltime=5:00 -l nodes=1:ppn=8 openmpi.sh -v file=hello_world

PBS (Portable Batch System) Commands on Torque

There are some PBS Commands that you can use for your customised PBS templates and scripts.

Note:

# Remarks:
#  A line beginning with # is a comments;
#  A line beginning with #PBS is a pbs command;
# Case sensitive.

Job Name (Default)

#PBS -N jobname

Specifies the number of nodes (nodes=N) and the number of processors per node (ppn=M) that the job should use

#PBS -l nodes=2:ppn=8

Specifies the maximum amount of physical memory used by any process in the job.

#PBS -l pmem=4gb

Specifies maximum walltime (real time, not CPU time)

#PBS -l walltime=24:00:00

Queue Name (If default is used, there is no need to specify)

#PBS -q fastqueue

Group account (for example, g12345) to be charged

#PBS -W group_list=g12345

Put both normal output and error output into the same output file.

#PBS -j oe

Send me an email when the job begins,end and abort

#PBS -m bea
#PBS -M mymail@mydomain.com

Export all my environment variables to the job

#PBS -V

Rerun this job if it fails

#PBS -r y

Predefined Environmental Variables for OpenPBS qsub

The following environment variable reflect the environment when the user run qsub

  1. PBS_O_HOSTThe host where you ran the qsub command.
  2. PBS_O_LOGNAMEYour user ID where you ran qsub
  3. PBS_O_HOMEYour home directory where you ran qsub
  4. PBS_O_WORKDIRThe working directory where you ran qsub

The following reflect the environment where the job is executing

  1. PBS_ENVIRONMENTSet to PBS_BATCH to indicate the job is a batch job, or # to PBS_INTERACTIVE to indicate the job is a PBS interactive job
  2. PBS_O_QUEUEThe original queue you submitted to
  3. PBS_QUEUEThe queue the job is executing from
  4. PBS_JOBNAMEThe job’s name
  5. PBS_NODEFILEThe name of the file containing the list of nodes assigned to the job

Configuring Submission Node for Torque 2.5

If you are planning to have more nodes where the users can do submission apart from the Head Node of the Cluster, you may want to configure a Submission Node. By default, TORQUE only allow one submission node. There are 2 ways to configure this submission node. One way is by using the Using RCmd authentication, the other is by using the “submit_host paramter” in the Torque Server

Step 1a: Configuring the Submission

First and Foremost, one of the main prerequistics is that the submission nodes must be part of the resource pool identified by the Torque Server. If  you are not part of the Torque Server, you may want to follow the steps to make the to-be-submission node part of the resource pool or a pbs_mom client. You can check the setup by looking at the Installing Torque 2.5 on CentOS 6 with xCAT tool, especially B. Configuring the TORQUE Clients. You might want to follow up with this optional setup Adding and Specifying Compute Resources at Torque to make sure your cores count are correct

Step 1b: Ensure the exchange keys between submission node and Torque Server

For more information, see Auto SSH Login without Password

Step 1c: Configure the submission node as a non-default queue (Optional)

For more information, see Using Torque to set up a Queue to direct users to a subset of resources

Step 2: Registering the Submission Node in Torque

If you do not wish the compute node to be a compute resource, you can put a non-default queue or unique queue which users  will  not send to.

Once you have configured the to-be-submission node as one of the client, you have to now to configure the torque server by this commands

# qmgr -c 'set server submit_hosts = hostname1'
# qmgr -c 'set server allow_node_submit = True'

Step 3: Putting Submission Node inside Torque Server /etc/host.equiv

# vim /etc/hosts.equiv
submission_node.cluster.com

Step 4: Test the configuration

Do a

$ qsub -I nodes=1:ppn=8

You should see from the torque server that the job has been submitted via the submission node by doing a qstat -an

$ qstat -an

Step 5: Mount Maui Information from PBS/MAUI Server

From the MAUI Server, do a NFS, mount the configuration and binaries of MAUI

Edit /etc/exports

/opt/maui               Submission-Node1(rw,no_root_squash,async,no_subtree_check) 
/usr/local/maui         Submission-Node1(rw,no_root_squash,async,no_subtree_check)

At the MAUI Server, restart NFS Services

# service restart nfs

At the submission node, make sure you have the mount point /opt/maui and /usr/local/maui for the

At /etc/fstab, mount the file system and restart netfs

head-node1:/usr/local/maui    /usr/local/maui        nfs      defaults  0 0
head-node1:/opt/maui          /opt/maui              nfs      defaults  0 0
#service netfs restart

Resources:

  1. Torque Server document 1.3.2 Server configuration
  2. Unable to Submit via Torque Submission Node – Socket_Connect Error for Torque 4.2.7
  3. Bad UID for job execution MSG=ruserok failed validating user1 from ServerNode while configuring Submission Node in Torque

Adding and Specifying Compute Resources at Torque

This blog entry is the follow-up of Installing Torque 2.5 on CentOS 6 with xCAT tool. After installing of Torque on the Head Node and Compute Node, the next things to do is to configure the  Torque Server. In this blog entry, I will focus on the Configuring the Compute Resources at Torque Server

Step 1: Adding Nodes to the Torque Server

# qmgr -c "create node node01"

Step 2: Configure Auto-Detect Nodes CPU Detection. Setting auto_node_np to TRUE overwrites the value of np set in $TORQUEHOME/server_priv/nodes

# qmgr -c "set server auto_node_np = True"

Step 3: Start the pbs_mom of the compute nodes, the torque server will detect the nodes automatically

# service pbs_mom start