Adding and Specifying Compute Resources at Torque

This blog entry is the follow-up of Installing Torque 2.5 on CentOS 6 with xCAT tool.

After installing of Torque on the Head Node and Compute Node, the next things to do is to configure the Torque Server. In this blog entry, I will focus on the Configuring the Compute Resources at Torque Server

 Step 1: Adding Nodes to the Torque Server

# qmgr -c "create node node01"

Step 2: Configure Auto-Detect Nodes CPU Detection. Setting auto_node_np to TRUE overwrites the value of np set in $TORQUEHOME/server_priv/nodes

# qmgr -c "set server auto_node_np = True"

Step 3: Start the pbs_mom of the compute nodes, the torque server will detect the nodes automatically

# service pbs_mom start

Tracking Batch Jobs at Platform LSF

The content article is taken from http://users.cs.fiu.edu/~tho01/psg/3rdParty/lsf4_userGuide/07-tracking.html

1. Displaying All Job Status

# bjobs -u all

2. Report Reasons why a job is pending

# bjobs -p

3. Report Pending Reasons with host names for each conditions

# bjobs -lp

4. Detailed Report on a specific jobs

# bjobs -l 6653

5. Reasons why the job is suspended

# bjobs -s

6. Displaying Job History

# bpeek 12345

7. Killing Jobs

# bkill 12345

8. Stop the Job

# bstop 12345

9 Resume the job

# bresume 12345

Platform LSF – Submitting and Controlling jobs

I thought I list out some useful commands that can be used for for viewing a cluster using an LSF Cluster. Please read the manual for more in-depth information. Taken from Platform LSF 8.3 Quick References.

Submitting and Controlling jobs
bbot Moves a pending job relative to the last job in the queue
bchkpnt Checkpoints a checkpointable job
bkill Send a signal to a job
bmig Migrates a checkpointable or rerunnable job
bmod Modifies job submission options
brequeue Kills and requeues a job
bresize Releases slots and cancels pending job resize allocation requests
brestart Restarts a checkpointed job
bresume Resumes a suspended job
bstop Suspends a job
bsub Submits a job
bswitch Moves unfinished jobs from one queue to another
btop Moves a pending job relative to the first job in the queue

References

  1. Platform LSF – Monitoring jobs and tasks
  2. Platform LSF – Administration and Accounting Commands
  3. Platform LSF – View Information about Cluster
  4. Platform LSF – Submitting and Controlling jobs

Platform LSF – Monitoring jobs and tasks

I thought I list out some useful commands that can be used for for viewing a cluster using an LSF Cluster. Please read the manual for more in-depth information. Taken from Platform LSF 8.3 Quick References.

Monitoring jobs and tasks
bacct Reports accounting statistics on completed LSF jobs
bapp Displays information about jobs attached to application profiles
bhist Displays historical information about jobs
bjobs Displays information about jobs
bpeek Displays stdout and stderr of unfinished jobs
bsla Displays information about service class configuration for goal-oriented service-level agreement (SLA) scheduling
bstatus Reads or sets external job status messages and data files

References

  1. Platform LSF – Monitoring jobs and tasks
  2. Platform LSF – Administration and Accounting Commands
  3. Platform LSF – View Information about Cluster
  4. Platform LSF – Submitting and Controlling jobs

Platform LSF – Administration and Accounting Commands

I thought I list out some useful commands that can be used for for viewing a cluster using an LSF Cluster. Please read the manual for more in-depth information. Taken from Platform LSF 8.3 Quick References.

Administration and Accounting commands
lsadmin LSF administrative tool to control the operation of the LIM and RES daemons in an LSF cluster, lsadmin help shows all subcommands
lsfinstall Install LSF using install.config input file
lsfrefresh Restart the LSF daemons on all hosts in the local cluster
lsfshutdown Shut down the LSF daemons on all hosts in the local cluster
lsfstartup Start the LSF daemons on all hosts in the local cluster
badmin LSF administrative tool to control the operation of the LSF Batch
system including sbatchd, mbatchd, hosts and queues, badmin help
shows all subcommands
bconf Changes LSF configuration in active memory
bladmin Displays the current LSF version number, cluster name and master host name

References

  1. Platform LSF – Monitoring jobs and tasks
  2. Platform LSF – Administration and Accounting Commands
  3. Platform LSF – View Information about Cluster
  4. Platform LSF – Submitting and Controlling jobs

Platform LSF – View Information about Cluster

I thought I list out some useful commands that can be used for for viewing a cluster using an LSF Cluster. Please read the manual for more in-depth information. Taken from Platform LSF 8.3 Quick References.

View Information about Cluster
bhosts Display hosts and static and dynamic resources
bmgroup Displays Information about host groups and compute units
blimits Displays Information about resource allocation limits of running jobs
bparams Displays Information about tunable batch system parameter
bqueues Displays Information about batch queue
busers Displays Information about users and user groups
lshosts Displays hosts and their static resource Information
lsid Displays the current LSF version number, cluster name and master host name
lsinfo Displays load sharing configuration information
lsload Displays dynamic load indices for hosts

References

  1. Platform LSF – Monitoring jobs and tasks
  2. Platform LSF – Administration and Accounting Commands
  3. Platform LSF – View Information about Cluster
  4. Platform LSF – Submitting and Controlling jobs

Using MOAB mdiag -n to provide state of nodes

mdiag -n command provides detailed information about the state of nodes
Moab or MAUi is currently tracking

Name                State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Res Classes                        Network                        Features

Node-c00            Idle   8:8    32161:32161       1:1       62862:64158   2.10  linux [NONE] DEF   0.00 000 [rambutan_8:8][queue_8:8][lemo [DEFAULT]    
.....
.....
Node-c03            Busy   0:8    32161:32161       1:1       62735:64158   2.10  linux [NONE] DEF   8.00 001 [rambutan_8:8][queue_0:8][queue [DEFAULT]
.....
.....

Interesting Columns that I would find especially useful is the STATE, Procs (Available Core), Swap and Load

You can further refine the search

# mdiag -n |grep Busy
# mdiag -n |grep Idle

Using pam_pbssimpleauth.so to authorise login for users for Torque

For a cluster shared by many users, it is important to prevent errant users from directly ssh into the compute nodes, thus bypassing the scheduler. To implement the pam module, compile the Torque Server based on Installing Torque 2.5 on CentOS 6

Step 1: You should be able to find the pam_pbssimpleauth.so packages at

$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.a
$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.la
$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.so

Step 2: Copy the  pam_pbssimpleauth.so to the compute nodes. Step 2b: DO not put the pam_pbssimpleauth.so in on the Head Node

# scp $TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.so node1:/lib64/security/

Step 3: Verify that the access.so is also present in the /lib64/security/ directory

# ls /lib64/security/access.so

Step 4: Add the access.so and pam_pbssimpleauth.so in the PAM configuration files

# vim /etc/pam.d/sshd
auth       required     pam_sepermit.so
auth       include      password-auth
account    required     pam_nologin.so

account    required     pam_pbssimpleauth.so
account    required     pam_access.so

account    include      password-auth
password   include      password-auth
.....
.....

When a user ssh’s to a node, this module will check the .JB files in $PBS_SERVER_HOME/mom_priv/jobs/ for a matching uid and that the job is running.

You can try the configuration

PBS scripts for mpirun parameters for Chelsio / Infiniband Cards

If you are running Chelsio Cards, you  may want to specify the mpirun parameters to ensure the

/usr/mpi/intel/openmpi-1.4.3/bin/mpirun 
-mca btl openib,sm,self --bind-to-core 
--report-bindings -np $NCPUS -machinefile $PBS_NODEFILE $PBS_O_WORKDIR/$file

–bind-to-core: Bind each MPI process to a core
–mca btl openib,sm,self: (Infiniband, shared memory, the loopback)

For information on Interprocess communication with shared memory,

  1. see Speaking UNIX: Interprocess communication with shared memory

Sequential execution of the Parallel or serial jobs on OpenPBS / Torque

If you have an requirement to execute jobs in sequence after the 1st job has completed, only then the 2nd job can launch, you can  use the -W command

For example, if you have a running job with a job ID 12345, and you run the next job to run only after job 12345 run.

$ qsub -q clusterqueue -l nodes=1:ppn=8 -W depend=afterany:12345 parallel.sh -v file=mybinaryfile

You will notice that the job will hold until the 1st job executed.

.....
.....
24328 kittycool Hold 2 10:00:00:00 Sat Mar 2 02:20:12