Deleting Nodes from a GPFS Cluster

Notes to take care of:

  1. A node being deleted cannot be the primary or secondary GPFS cluster configuration server unless you intend to delete the entire cluster. Verify this by issuing the mmlscluster command. If a node to be deleted is one of the servers and you intend to keep the cluster, issue the mmchcluster command to assign another node as a configuration server before deleting the node.
  2. A node that is being deleted cannot be designated as an NSD server for any disk in the GPFS cluster, unless you intend to delete the entire cluster. Verify this by issuing the mmlsnsd command. If a node that is to be deleted is an NSD server for one or more disks, move the disks to nodes that will remain in the cluster. Issue the mmchnsd command to assign new NSD servers for those disks.

Step 1: Shutdown the Nodes before deleting

On the NSD Node

# mmshutdown -N node01
mmshutdown -N node01 
Wed May  1 01:09:51 SGT 2013: mmshutdown: Starting force unmount of GPFS file systems
Wed May  1 01:09:56 SGT 2013: mmshutdown: Shutting down GPFS daemons
node01:  Shutting down!
node01:  'shutdown' command about to kill process 10682
node01:  Unloading modules from /lib/modules/2.6.32-220.el6.x86_64/extra
node01:  Unloading module mmfs26
node01:  Unloading module mmfslinux
node01:  Unloading module tracedev
Wed May  1 01:10:04 SGT 2013: mmshutdown: Finished

Step 2: Deleting a Node

# mmdelnode -N node01
Verifying GPFS is stopped on all affected nodes ...
mmdelnode: Command successfully completed
mmdelnode: Propagating the cluster configuration data to all
affected nodes.  This is an asynchronous process.

Step 3: Confirm that the nodes has been deleted

# mmlscluster

Step 4: If you are deleting the client permanently, check and update the license file.

# mmlslicense
Summary information
---------------------
Number of nodes defined in the cluster:                         20
Number of nodes with server license designation:                 3
Number of nodes with client license designation:                17
Number of nodes still requiring server license designation:      0
Number of nodes still requiring client license designation:      0

Step 5: Update the license file.

# vim /gpfs_install/license_client.lst

Step 6: Update the license file

mmchlicense client --accept -N license_client.lst

Related Information:

  1. Resolving mmremote: Unknown GPFS execution environment when issuing mmdelnode commands

Enable and Disable Quota Management for GPFS

Taken from GPFS Administration and Programming Reference – Enabling and disabling GPFS quota management

To enable GPFS quota management on an existing GPFS file system

  1. Unmount the file system everywhere.
  2. Run the mmchfs -Q yes command. This command automatically activates quota enforcement whenever the file system is mounted.
  3. Remount the file system, activating the new quota files. All subsequent mounts follow the new quota setting.
  4. Compile inode and disk block statistics using the mmcheckquota command. The values obtained can be used to establish realistic quota values when issuing the mmedquota command.
  5. Issue the mmedquota command to explicitly set quota values for users, groups, or filesets.

Once GPFS quota management has been enabled, you may establish quota values by:

  1. Setting default quotas for all new users, groups of users, or filesets.
  2. Explicitly establishing or changing quotas for users, groups of users, or filesets.
  3. Using the gpfs_quotactl() subroutine.

To disable quota management:

  1. Unmount the file system everywhere.
  2. Run the mmchfs -Q no command.
  3. Remount the file system, deactivating the quota files. All subsequent mounts obey the new quota setting.

To enable GPFS quota management on a new GPFS file system:

  1. Run  mmcrfs -Q yes command. This option automatically activates quota enforcement whenever the file system is mounted.
  2. Mount the file system.
  3. Issue the mmedquota command to explicitly set quota values for users, groups, or filesets. See Explicitly establishing and changing quotas.

Clearing memory cache

First thing you need is kernel 2.6.16

The command to clear the memory cache.

# sync
# echo 3 > /proc/sys/vm/drop_caches

sync -> refers to the tells the kernel
that you want the data written to the disk

echo 3 > /proc/sys/vm/drop_caches -> “To free pagecache, dentries and inodes:”

From the article Invalidating the Linux buffer cache ,

To free pagecache:    
# echo 1 > /proc/sys/vm/drop_caches

To free dentries and inodes:
# echo 2 > /proc/sys/vm/drop_caches

To free pagecache, dentries and inodes:
# echo 3 > /proc/sys/vm/drop_caches

References:

  1. Invalidating the Linux buffer cache
  2. HOWTO: Clear filesystem memory cache
  3. drop_caches

Using MOAB mdiag -n to provide state of nodes

mdiag -n command provides detailed information about the state of nodes
Moab or MAUi is currently tracking

Name                State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Res Classes                        Network                        Features

Node-c00            Idle   8:8    32161:32161       1:1       62862:64158   2.10  linux [NONE] DEF   0.00 000 [rambutan_8:8][queue_8:8][lemo [DEFAULT]    
.....
.....
Node-c03            Busy   0:8    32161:32161       1:1       62735:64158   2.10  linux [NONE] DEF   8.00 001 [rambutan_8:8][queue_0:8][queue [DEFAULT]
.....
.....

Interesting Columns that I would find especially useful is the STATE, Procs (Available Core), Swap and Load

You can further refine the search

# mdiag -n |grep Busy
# mdiag -n |grep Idle

Using pam_pbssimpleauth.so to authorise login for users for Torque

For a cluster shared by many users, it is important to prevent errant users from directly ssh into the compute nodes, thus bypassing the scheduler. To implement the pam module, compile the Torque Server based on Installing Torque 2.5 on CentOS 6

Step 1: You should be able to find the pam_pbssimpleauth.so packages at

$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.a
$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.la
$TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.so

Step 2: Copy the  pam_pbssimpleauth.so to the compute nodes. Step 2b: DO not put the pam_pbssimpleauth.so in on the Head Node

# scp $TORQUE_HOME/tpackages/pam/lib64/security/pam_pbssimpleauth.so node1:/lib64/security/

Step 3: Verify that the access.so is also present in the /lib64/security/ directory

# ls /lib64/security/access.so

Step 4: Add the access.so and pam_pbssimpleauth.so in the PAM configuration files

# vim /etc/pam.d/sshd
auth       required     pam_sepermit.so
auth       include      password-auth
account    required     pam_nologin.so

account    required     pam_pbssimpleauth.so
account    required     pam_access.so

account    include      password-auth
password   include      password-auth
.....
.....

When a user ssh’s to a node, this module will check the .JB files in $PBS_SERVER_HOME/mom_priv/jobs/ for a matching uid and that the job is running.

You can try the configuration