June 24, 2013 by kittycool only

Platform LSF – Submitting and Controlling jobs

I thought I list out some useful commands that can be used for for viewing a cluster using an LSF Cluster. Please read the manual for more in-depth information. Taken from Platform LSF 8.3 Quick References.

**Submitting and Controlling jobs**
bbot	Moves a pending job relative to the last job in the queue
bchkpnt	Checkpoints a checkpointable job
bkill	Send a signal to a job
bmig	Migrates a checkpointable or rerunnable job
bmod	Modifies job submission options
brequeue	Kills and requeues a job
bresize	Releases slots and cancels pending job resize allocation requests
brestart	Restarts a checkpointed job
bresume	Resumes a suspended job
bstop	Suspends a job
bsub	Submits a job
bswitch	Moves unfinished jobs from one queue to another
btop	Moves a pending job relative to the first job in the queue

References

June 14, 2013 by kittycool only

Enabling SRIOV in BIOS for IBM Servers and Blade Servers

Step 1: Power on the system, and press F1 to enter the Setup utility.

Step 2: Select System Settings and then Network.

Step 3: Under the Network Device List, select the device to be configured and press Enter to see all the Network Device options (Figure 1).

Step 4: Select the device’s description and press Enter to configure the device (Figure 2)

Step 5: From the selection menu, select Advanced Mode and press Enter to change the value (Figure 3).

Step 6: Choose Enable and press Enter.

Step 7: On the same selection menu, select Controller Configuration and press Enter to enter the configuration menu.

Step 8: Select Configure SRIOV and hit Enter.

Step 9: On the Configure SRIOV page, press Enter to toggle the values

Step 10: Select Enable and press Enter

Step 11: Select Save Current Configurations and press Enter.

Step 12: Press Esc to exit the menu. Then, click Save to save the configuration.

Step 13: Reboot the system.

June 12, 2013 by kittycool only

Platform LSF – Monitoring jobs and tasks

**Monitoring jobs and tasks**
bacct	Reports accounting statistics on completed LSF jobs
bapp	Displays information about jobs attached to application profiles
bhist	Displays historical information about jobs
bjobs	Displays information about jobs
bpeek	Displays stdout and stderr of unfinished jobs
bsla	Displays information about service class configuration for goal-oriented service-level agreement (SLA) scheduling
bstatus	Reads or sets external job status messages and data files

References

June 10, 2013 by kittycool only

Adding and Removing the 2nd Mellanox Ethernet Port as an uplink to an Existing Vswitch using the CLI

At VSphere 5.1 Client. I was able to see the Dual-Port Network Adapter (vmnic22.p1, vmnic22.p2) after I install the Vmware Installing Mellanox ConnectX® EN 10GbE Drivers for VMware® ESX 5.x Server.

But somehow I am not able to use the 2nd port of the Mellanox ConnectX 10G on the VSphere Client > Configuration > Networking. It will not be visible. However at the VSphere Client > Configuration > Networking > Add Networking, I not able to see the 2nd Port being available.

I found the document from Mellanox (MellanoxMLX4_ENDriverforVMwareESXi-5.xREADME) which is useful to resolve the issue. At Page 10,

Adding the Device as an uplink to an Existing Vswitch using the CLI

Step 1: Log into the ESXi server with root permission

Step 2: Add an uplink to a vswitch, run:

# esxcli network vswitch standard uplink add –u <uplink_name> -v <vswitch_name>

* Uplink_name refer to the name used by ESX for the network Adapter. For example, vmnic22.p2 is the uplink name

Step 3: Check that uplink was added successfully. Run:

# esxcli network vswitch standard list -v <vswitch_name>

Removing the Device an an uplink to an Existing Vswitch using the CLI

Step 1: Log into the ESXi server with root permissions

Step 2: Remove an uplink from a vswitch, run:

# esxcli network vswitch standard uplink remove -u <uplink_name> -v <vswitch_name>

June 3, 2013 by kittycool only

Helping users to SSH without password into the Compute Nodes manually

There are occasionally in a cluster environment that users accidentally delete their head node SSH keys and later cannot submit their jobs to the queue or their MPI jobs cannot scale beyond 1 node. The system you will see when you turn on the verbose method

To conduct a quick test,

# ssh -v remote-host

you will see an errors similar to such as those below:

debug1: Unspecified GSS failure.  Minor code may provide more information
Unknown code krb5 195

debug1: Miscellaneous failure
No credentials cache found

To reinstate the password-less access to compute nodes, you have to do the following. First thing first, please do backup files at your ~/.ssh/

Step 1: Regenerate the SSH keys

Auto SSH Login without Password

Step 2: Append the public keys ~/.ssh/id_rsa.pub and put into the ~/.ssh/authorized_keys

# cd ~/.ssh/

# cat id_rsa.pub >> authorized_keys

# chmod 400 /home/myuser/.ssh/authorized_keys

Step 3: Try ssh into the compute nodes. It should be clear password-less access to all nodes.

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

The Linux Cluster

Linux Cluster Blog is a collection of how-to and tutorials for Linux Cluster and Enterprise Linux

Month: June 2013