November 8, 2013 by kittycool only

Adding and Specifying Compute Resources at Torque

This blog entry is the follow-up of Installing Torque 2.5 on CentOS 6 with xCAT tool.

After installing of Torque on the Head Node and Compute Node, the next things to do is to configure the Torque Server. In this blog entry, I will focus on the Configuring the Compute Resources at Torque Server

Step 1: Adding Nodes to the Torque Server

# qmgr -c "create node node01"

Step 2: Configure Auto-Detect Nodes CPU Detection. Setting auto_node_np to TRUE overwrites the value of np set in $TORQUEHOME/server_priv/nodes

# qmgr -c "set server auto_node_np = True"

Step 3: Start the pbs_mom of the compute nodes, the torque server will detect the nodes automatically

# service pbs_mom start

August 8, 2013 by kittycool only

Tracking Batch Jobs at Platform LSF

The content article is taken from http://users.cs.fiu.edu/~tho01/psg/3rdParty/lsf4_userGuide/07-tracking.html

1. Displaying All Job Status

# bjobs -u all

2. Report Reasons why a job is pending

# bjobs -p

3. Report Pending Reasons with host names for each conditions

# bjobs -lp

4. Detailed Report on a specific jobs

# bjobs -l 6653

5. Reasons why the job is suspended

# bjobs -s

6. Displaying Job History

# bpeek 12345

7. Killing Jobs

# bkill 12345

8. Stop the Job

# bstop 12345

9 Resume the job

# bresume 12345

June 24, 2013 by kittycool only

Platform LSF – Submitting and Controlling jobs

I thought I list out some useful commands that can be used for for viewing a cluster using an LSF Cluster. Please read the manual for more in-depth information. Taken from Platform LSF 8.3 Quick References.

**Submitting and Controlling jobs**
bbot	Moves a pending job relative to the last job in the queue
bchkpnt	Checkpoints a checkpointable job
bkill	Send a signal to a job
bmig	Migrates a checkpointable or rerunnable job
bmod	Modifies job submission options
brequeue	Kills and requeues a job
bresize	Releases slots and cancels pending job resize allocation requests
brestart	Restarts a checkpointed job
bresume	Resumes a suspended job
bstop	Suspends a job
bsub	Submits a job
bswitch	Moves unfinished jobs from one queue to another
btop	Moves a pending job relative to the first job in the queue

References

June 12, 2013 by kittycool only

Platform LSF – Monitoring jobs and tasks

**Monitoring jobs and tasks**
bacct	Reports accounting statistics on completed LSF jobs
bapp	Displays information about jobs attached to application profiles
bhist	Displays historical information about jobs
bjobs	Displays information about jobs
bpeek	Displays stdout and stderr of unfinished jobs
bsla	Displays information about service class configuration for goal-oriented service-level agreement (SLA) scheduling
bstatus	Reads or sets external job status messages and data files

References

May 28, 2013 by kittycool only

Platform LSF – Administration and Accounting Commands

**Administration and Accounting commands**
lsadmin	LSF administrative tool to control the operation of the LIM and RES daemons in an LSF cluster, lsadmin help shows all subcommands
lsfinstall	Install LSF using install.config input file
lsfrefresh	Restart the LSF daemons on all hosts in the local cluster
lsfshutdown	Shut down the LSF daemons on all hosts in the local cluster
lsfstartup	Start the LSF daemons on all hosts in the local cluster
badmin	LSF administrative tool to control the operation of the LSF Batch system including sbatchd, mbatchd, hosts and queues, badmin help shows all subcommands
bconf	Changes LSF configuration in active memory
bladmin	Displays the current LSF version number, cluster name and master host name

References

May 28, 2013 by kittycool only

Platform LSF – View Information about Cluster

**View Information about Cluster**
bhosts	Display hosts and static and dynamic resources
bmgroup	Displays Information about host groups and compute units
blimits	Displays Information about resource allocation limits of running jobs
bparams	Displays Information about tunable batch system parameter
bqueues	Displays Information about batch queue
busers	Displays Information about users and user groups
lshosts	Displays hosts and their static resource Information
lsid	Displays the current LSF version number, cluster name and master host name
lsinfo	Displays load sharing configuration information
lsload	Displays dynamic load indices for hosts

References

April 8, 2013 by kittycool only

Using MOAB mdiag -n to provide state of nodes

mdiag -n command provides detailed information about the state of nodes
Moab or MAUi is currently tracking

Name                State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Res Classes                        Network                        Features

Node-c00            Idle   8:8    32161:32161       1:1       62862:64158   2.10  linux [NONE] DEF   0.00 000 [rambutan_8:8][queue_8:8][lemo [DEFAULT]    
.....
.....
Node-c03            Busy   0:8    32161:32161       1:1       62735:64158   2.10  linux [NONE] DEF   8.00 001 [rambutan_8:8][queue_0:8][queue [DEFAULT]
.....
.....

Interesting Columns that I would find especially useful is the STATE, Procs (Available Core), Swap and Load

You can further refine the search

# mdiag -n |grep Busy
# mdiag -n |grep Idle

April 1, 2013 by kittycool only

Using pam_pbssimpleauth.so to authorise login for users for Torque

For a cluster shared by many users, it is important to prevent errant users from directly ssh into the compute nodes, thus bypassing the scheduler. To implement the pam module, compile the Torque Server based on Installing Torque 2.5 on CentOS 6

Step 1: You should be able to find the pam_pbssimpleauth.so packages at

$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.a
$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.la
$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.so

Step 2: Copy the pam_pbssimpleauth.so to the compute nodes. Step 2b: DO not put the pam_pbssimpleauth.so in on the Head Node

# scp $TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.so node1:/lib64/security/

Step 3: Verify that the access.so is also present in the /lib64/security/ directory

# ls /lib64/security/access.so

Step 4: Add the access.so and pam_pbssimpleauth.so in the PAM configuration files

# vim /etc/pam.d/sshd

auth       required     pam_sepermit.so
auth       include      password-auth
account    required     pam_nologin.so

account    required     pam_pbssimpleauth.so
account    required     pam_access.so

account    include      password-auth
password   include      password-auth
.....
.....

When a user ssh’s to a node, this module will check the .JB files in $PBS_SERVER_HOME/mom_priv/jobs/ for a matching uid and that the job is running.

You can try the configuration

March 6, 2013 by kittycool only

PBS scripts for mpirun parameters for Chelsio / Infiniband Cards

If you are running Chelsio Cards, you may want to specify the mpirun parameters to ensure the

/usr/mpi/intel/openmpi-1.4.3/bin/mpirun 
-mca btl openib,sm,self --bind-to-core 
--report-bindings -np $NCPUS -machinefile $PBS_NODEFILE $PBS_O_WORKDIR/$file

–bind-to-core: Bind each MPI process to a core
–mca btl openib,sm,self: (Infiniband, shared memory, the loopback)

For information on Interprocess communication with shared memory,

see Speaking UNIX: Interprocess communication with shared memory

March 1, 2013 by kittycool only

Sequential execution of the Parallel or serial jobs on OpenPBS / Torque

If you have an requirement to execute jobs in sequence after the 1st job has completed, only then the 2nd job can launch, you can use the -W command

For example, if you have a running job with a job ID 12345, and you run the next job to run only after job 12345 run.

$ qsub -q clusterqueue -l nodes=1:ppn=8 -W depend=afterany:12345 parallel.sh -v file=mybinaryfile

You will notice that the job will hold until the 1st job executed.

.....
.....
24328 kittycool Hold 2 10:00:00:00 Sat Mar 2 02:20:12

The Linux Cluster

Linux Cluster Blog is a collection of how-to and tutorials for Linux Cluster and Enterprise Linux

Scheduler

Adding and Specifying Compute Resources at Torque

Tracking Batch Jobs at Platform LSF

Platform LSF – Submitting and Controlling jobs

Platform LSF – Monitoring jobs and tasks

Platform LSF – Administration and Accounting Commands

Platform LSF – View Information about Cluster

Using MOAB mdiag -n to provide state of nodes

Using pam_pbssimpleauth.so to authorise login for users for Torque

PBS scripts for mpirun parameters for Chelsio / Infiniband Cards

Sequential execution of the Parallel or serial jobs on OpenPBS / Torque