Platform LSF – Submitting and Controlling jobs

I thought I list out some useful commands that can be used for for viewing a cluster using an LSF Cluster. Please read the manual for more in-depth information. Taken from Platform LSF 8.3 Quick References.

Submitting and Controlling jobs
bbot Moves a pending job relative to the last job in the queue
bchkpnt Checkpoints a checkpointable job
bkill Send a signal to a job
bmig Migrates a checkpointable or rerunnable job
bmod Modifies job submission options
brequeue Kills and requeues a job
bresize Releases slots and cancels pending job resize allocation requests
brestart Restarts a checkpointed job
bresume Resumes a suspended job
bstop Suspends a job
bsub Submits a job
bswitch Moves unfinished jobs from one queue to another
btop Moves a pending job relative to the first job in the queue

References

  1. Platform LSF – Monitoring jobs and tasks
  2. Platform LSF – Administration and Accounting Commands
  3. Platform LSF – View Information about Cluster
  4. Platform LSF – Submitting and Controlling jobs

Enabling SRIOV in BIOS for IBM Servers and Blade Servers

Step 1: Power on the system, and press F1 to enter the Setup utility.

Step 2: Select System Settings and then Network.

Step 3: Under the Network Device List, select the device to be configured and press Enter to see all the Network Device options (Figure 1).

Bio_Pic1

Step 4: Select the device’s description and press Enter to configure the device (Figure 2)

Bio_Pic2

Step 5: From the selection menu, select Advanced Mode and press Enter to change the value (Figure 3).

Bio_Pic3

Step 6: Choose Enable and press Enter.

Step 7: On the same selection menu, select Controller Configuration and press Enter to enter the configuration menu.

Step 8: Select Configure SRIOV and hit Enter.

Step 9: On the Configure SRIOV page, press Enter to toggle the values

Step 10: Select Enable and press Enter

Step 11: Select Save Current Configurations and press Enter.

Step 12: Press Esc to exit the menu. Then, click Save to save the configuration.

Step 13: Reboot the system.

Platform LSF – Monitoring jobs and tasks

I thought I list out some useful commands that can be used for for viewing a cluster using an LSF Cluster. Please read the manual for more in-depth information. Taken from Platform LSF 8.3 Quick References.

Monitoring jobs and tasks
bacct Reports accounting statistics on completed LSF jobs
bapp Displays information about jobs attached to application profiles
bhist Displays historical information about jobs
bjobs Displays information about jobs
bpeek Displays stdout and stderr of unfinished jobs
bsla Displays information about service class configuration for goal-oriented service-level agreement (SLA) scheduling
bstatus Reads or sets external job status messages and data files

References

  1. Platform LSF – Monitoring jobs and tasks
  2. Platform LSF – Administration and Accounting Commands
  3. Platform LSF – View Information about Cluster
  4. Platform LSF – Submitting and Controlling jobs

Adding and Removing the 2nd Mellanox Ethernet Port as an uplink to an Existing Vswitch using the CLI

At VSphere 5.1 Client. I was able to see the Dual-Port Network Adapter (vmnic22.p1, vmnic22.p2) after I install the Vmware  Installing Mellanox ConnectX® EN 10GbE Drivers for VMware® ESX 5.x Server.

But somehow I am not able to use the 2nd port of the Mellanox ConnectX 10G on the VSphere Client > Configuration > Networking. It will not be visible. However at the VSphere Client > Configuration > Networking > Add Networking, I not able to see the 2nd Port being available.

I found the document from Mellanox (MellanoxMLX4_ENDriverforVMwareESXi-5.xREADME) which is useful to resolve the issue. At Page 10,

Adding the Device as an uplink to an Existing Vswitch using the CLI

Step 1: Log into the ESXi server with root permission

Step 2: Add an uplink to a vswitch, run:

# esxcli network vswitch standard uplink add –u <uplink_name> -v <vswitch_name>

* Uplink_name refer to the name used by ESX for the network Adapter. For example, vmnic22.p2 is the uplink name

Step 3: Check that uplink was added successfully. Run:

# esxcli network vswitch standard list -v <vswitch_name>

Removing the Device an an uplink to an Existing Vswitch using the CLI

Step 1: Log into the ESXi server with root permissions

Step 2: Remove an uplink from a vswitch, run:

# esxcli network vswitch standard uplink remove -u <uplink_name> -v <vswitch_name>

Helping users to SSH without password into the Compute Nodes manually

There are occasionally in a cluster environment that users accidentally delete their head node SSH keys and later cannot submit their jobs to the queue or their MPI jobs cannot scale beyond 1 node. The system you will see when you turn on the verbose method

To conduct a quick test,

# ssh -v remote-host

you will see an errors similar to  such as those below:

debug1: Unspecified GSS failure.  Minor code may provide more information
Unknown code krb5 195

OR

debug1: Miscellaneous failure
No credentials cache found

To reinstate the password-less access to compute nodes, you have to do the following. First thing first, please do backup files at your ~/.ssh/

Step 1: Regenerate the SSH keys

Auto SSH Login without Password

Step 2: Append the public keys ~/.ssh/id_rsa.pub and put into the ~/.ssh/authorized_keys

# cd ~/.ssh/
# cat id_rsa.pub >> authorized_keys
# chmod 400 /home/myuser/.ssh/authorized_keys

Step 3: Try ssh into the compute nodes. It should be clear password-less access to all nodes.