November 29, 2024 by kittycool only

Adding cgroups control to GPGPU Servers for PSB-Professional

After adding GPGPU node to PBS Professional, you have to make sure, it is in the right queue first

qmgr -c "set node gpu-node resources_available.Qlist = gpu_v100"

Locate the cgroups.json2 script in the directory you have placed in. Check by doing the following

ll cgroups.json2

If there, edit the file.

vim cgroups.json2

Find the place where the “run_only_on_hosts” and add the node

"run_only_on_hosts" : [ "gpu-node1", "gpu-node2", "gpu-node3", "gpu-node4],
        "cgroup":
......
......
......

Use the qmgr to import the file

qmgr -c "import hook cgroups application/x-config default cgroups.json2"

Check that the PBS has detected the node correctly

pbsnodes -aSj |grep gpu-node1

September 26, 2024 by kittycool only

Preparing Login Nodes that is able to submit jobs for PBS Professional-19.2.5

I assume that you are using the internal and external networks for your login nodes. The Internal network should be interacting with the PBS Schedulers and the External network should as least have SSH port opened for users to Interact with the Login Nodes.

For Internal Network Ports Opening for PBS-Professional Clients, do take at Firewall Ports Opening for PBS-Pro Server and Clients

Make sure you put the correct Ethernet in the right zone with the right ports opening. You can take a look at how you can manipulate Linux firewall with the blog Using firewall-cmd in CentOS 7 which is applicable even for Rocky 8 which used firewalld

After unpacking, run the below client rpm only. There is no need to register the nodes with PBS Scheduler

# rpm -Uvh pbspro-client-19.2.5.20191022141354-0.el8.x86_64.rpm

# bash

December 31, 2023 by kittycool only

Troubleshooting the PBS Control System and PBS Server

I was having this issue after I submitted a job. This was due to some configuration I had to do to improve security which is similar to Using the Host’s FirewallD as the Main Firewall to Secure Docker

qsub: Budget Manager: License is unverified. AM is not handling requests

To resolve the issue, I took the following Steps. On the PBS-Control Server,

Step 1: Export the Path of the AM Database.

export PATH=/opt/am/postgres/bin:$PATH

Step 2: Check that the Docker Container Services are started in the System. You may want to start the dockers to capture any errors. If the docker is not able to start up, it is likely due to the firewall settings.

# systemctl status firewalld.service.

Step 3: I restarted the PBS Altair Service

# systemctl restart altaircontrol.service

Step 4: I use the Docker Command to return an overview of all running containers

# docker ps

At the PBS-Server, Restart the AM Control Register is working

# /opt/am/libexec/am_control_register

To Test, Submit an Interactive Job with the correct Project Code, it should work.

October 15, 2023 by kittycool only

Allowing users to bypass PBS-Professional Scheduler to SSH directly into the Compute Node

For some special users like Adminsitrators, who needs to SSH directly into the compute instead of submitting to the scheduler with using root, you may want to do the following:

At the Compute Node

# vim /var/spool/pbs/mom_priv/config

Find the $restrict_user_exceptions

$clienthost 192.168.x.x
$clienthost 192.168.y.y
$restrict_user_maxsysid 999
$restrict_user True
$restrict_user_exceptions user1
$usecp *:/home/ /home/

Restart the PBS Service

# /etc/init.d/pbs restart

September 13, 2021 by kittycool only

PBS Professional MoM Access Configuration Parameters

Taken from PBS Professional Admin Guide

The Configuration Parameters can be found at /var/spool/pbs/mom_priv/config

$restrict_user <value>

Controls whether users not submitting jobs have access to this machine. When True, only those users running jobs are allowed access.
Format: Boolean
Default: off

$restrict_user_exceptions <user_list>

List of users who are exempt from access restrictions applied by $restrict_user. Maximum number of names in list is 10.
Format: Comma-separated list of usernames; space allowed after comma

$restrict_user_maxsysid <value>

Allows system processes to run when $restrict_user is enabled. Any user with a numeric user ID less than or equal to value is exempt from restrictions applied by $restrict_user.
Format: Integer
Default: 999

Example

To restrict user access to those running jobs, add:

$restrict_user True

To specify the users who are allowed access whether or not they are running jobs, add:

$restrict_user_exceptions User1, User2

To allow system processes to run, specify the maximum numeric user ID by adding:

$restrict_user_maxsysid 999

August 11, 2021 by kittycool only

Restarting PBS-Pro license Servers Daemons

If you are required to restart the daemons of PBS-Professional, do the following at each of the license server

# /etc/init.d/altairlmxd restart

At the client side,

# service pbsworks-pa restart

July 28, 2021 by kittycool only

Quick Fix to change attribute of Jobs and Change Queue in PBS Pro

If you need to change the Job Attributes AND change the queue in a single command, it will fail

% qalter -l walltime=12:00:00 -q myOldQueue JobID

Step 1: Change the Attribute like walltime

% qalter -l walltime=06:00:00 jobid

Step 2: Move the to a new Queue

% qmove myNewQueue JobID

July 27, 2021 by kittycool only

Quick Fix to add Queue for PBS Pro

One of the quickest way to install the PBS Professional Queue is take 1 queue and modify the example

At your node holding the PBS Scheduler

# qmgr -c "print queue @default"

.....
.....
# Create and define queue q64
#
create queue q64
set queue q64 queue_type = Execution
set queue q64 Priority = 100
set queue q64 resources_max.ncpus = 256
set queue q64 resources_max.walltime = 500:00:00
set queue q64 resources_default.charge_rate = 0.04
set queue q64 default_chunk.Qlist = q64
set queue q64 max_run_res.ncpus = [u:PBS_GENERIC=256]
set queue q64 enabled = True
set queue q64 started = True
#
.....
.....

Copy out the information and pipe it into a file

# qmgr -c "print queue q64" > q64_new_queue

Edit the File and save it

# Create and define queue q64_new_queue
#
create queue q64_new_queue
#
set queue q64_new_queue queue_type = Execution
set queue q64_new_queue Priority = 100
set queue q64_new_queue resources_max.ncpus = 256
set queue q64_new_queue resources_max.walltime = 500:00:00
set queue q64_new_queue resources_default.charge_rate = 0.04
set queue q64_new_queue default_chunk.Qlist = q64
set queue q64_new_queue max_run_res.ncpus = [u:PBS_GENERIC=256]
set queue q64_new_queue enabled = True
set queue q64_new_queue started = True
#

Pipe it back to qmgr

# qmgr < q64_new_queue

You should be able to see the new queue

...
...

# Create and define queue q64_new_queue
#
create queue q64_new_queue
#
set queue q64_new_queue queue_type = Execution
set queue q64_new_queue Priority = 100
set queue q64_new_queue resources_max.ncpus = 256
set queue q64_new_queue resources_max.walltime = 500:00:00
set queue q64_new_queue resources_default.charge_rate = 0.04
set queue q64_new_queue default_chunk.Qlist = q64
set queue q64_new_queue max_run_res.ncpus = [u:PBS_GENERIC=256]
set queue q64_new_queue enabled = True
set queue q64_new_queue started = True
#
...
...

January 27, 2021 by kittycool only

Running Arrays on PBS Professional

If you are intending to run the same program with the different input files, it is best you use Jobs Array instead of creating separate programs for the input files which is tedious. It is very easy

Amending the Submission Scripts (Part 1)

To create an arrays jobs, you have to use the -J option on the PBS Scripts. For 10 sub-jobs, you do the following

#PBS -J 1-10

Amending the Submission Scripts (Part 2)

If your input files are concatenated with a running number. For example, if your input file is data1.gjf, data2.gjf, data3.gjf, data4.gjf, data5.gjf ….. data10.gjf

inputfile=data$PBS_ARRAY_INDEX.gjf

Submitting the Jobs

a. To submit the jobs, just

% qsub yoursubmissionscript.pbs

Checking Jobs

b. You will notice that after you qstat, you will notice that your jobs bas a “B”

% qstat -u user1
544198[].node1 Gaussian-09e user1 0 B q32

You have to do a “-t” or “-Jt”

% qstat -t 544198[]

% qstat -t 544198[]
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
544198[].node1 Gaussian-09e user1 0 B q32
544198[54].node1 Gaussian-09e user1 00:40:21 R q32
544198[55].node1 Gaussian-09e user1 00:15:25 R q32

To delete the Sub Jobs

% qdel "544198[5]"

December 12, 2020 by kittycool only

Basic Tracing of Jobs Issues in PBS Professional

Step 1: Proceed to the Head Node (Scheduler)

Once you have the Job ID you wish to investigate, go to the Head Node and do. The “-n” is the number of days to search logs at

% tracejob -n 10 jobID

From the tracejob, you will be able to take a peek which node the job landed. Next you can go the node in question and find information from the mom_logs

% vim /var/spool/pbs/mom_logs/thedateyouarelookingat

For example,

% vim /var/spool/pbs/mom_logs/20201211

Using Vim, search for the Job ID

? yourjobID

You should be able to get a good hint of what has happened. In my case is that my nvidia drivers are having issues.

The Linux Cluster

Linux Cluster Blog is a collection of how-to and tutorials for Linux Cluster and Enterprise Linux

Scheduler