Basic Installing and Configuring of GPFS Cluster (Part 4)

Step 10: Create a NSD Specification File at

At /gpfs_install, create a disk.lst

# vim disk.lst

Example of the file using primary and secondary NSD are as followed

/dev/sdb:nsd1-nas,nsd2-nas::::ds4200_b
/dev/sdc:nsd2-nas,nsd1-nas::::ds4200_c

The format is
s1:s2:s3:s4:s5:s6:s7

where
s1 = scsi device
s2 = NSD server list seperate by comma. Arrange in primary and secondary order
s3 = NULL (retained for legacy reasons)
s4 = usage
s5 = failure groups
s6 = NSD name
s7 = storage pool name

Step 11: Backup the disk.lst

Back up this specifications since its an input/output file for the mmcrnsd.

# cp disk.lst disk.lst.org

Step 12: Create the new NSD specification file

# mmcrnsd -F disk.lst -v no

-F = name of the NSD Specification File
-v = Check the disk is part of an eixsting GPFS file system or ever had a GPFS file system on it (if yes, mmcrnsd will not create it as a new NSD

mmcrnsd: Processing disk /dev/sdb
mmcrnsd: Processing disk /dev/sdc
mmcrnsd: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

Step 13: Verify that the NSD is properly created.

# mmlsnsd
File system   Disk name    NSD servers
---------------------------------------------------------------------------
gpfs1         ds4200_b     nsd1-nas,nsd2-nas
gpfs1         ds4200_c     nsd2-nas,nsd1-nas

Step 14: Creating different partitions

If you are just creating a single partitions, the above will suffice. If you are creating more than 1 partition, you should allocate the appropriate number of LUNs and repeat Step 11 – 13. But for each partition you can use different “disk.lst” name such as disk2.lst, disk3.lst etc.

Step 15: Create the GPFS  file system

# mmcrfs /gpfs1 gpfs1 -F disk.lst -A yes -B 1m -v no -n 50 -j scatter

/gpfs1 = a mount point
gpfs1 = device entry in /dev for the file system
-F = output file from the mmcrnsd command
-A = mount the file system automatically every time mmfsd is started
-B = actual block size for this file system; it can not be larger than the maxblocksize set by the mmchconfig command
-v = check if this disk is part of an existing GPFS file system or ever had a GPFS file system on it. If yes, mmcrfs will not include this disk in the file system
-n = estimated number of nodes that will mount this file system.

If you have more than 1 partitions, you have to create the file system

# mmcrfs /gpfs2 gpfs2 -F disk2.lst -A yes -B 1m -b no -n 50 -j scatter
The following disks of gpfs1 will be formatted on nsd1-nas
.....
.....
Formatting file system
Disk up to 2.7 TB  can be added to
storage pool 'dcs_4200'
Creating Inode File
Creating Allocation Maps
Clearing Inode Allocation Map
Clearing Block Allocation Map
Formatting Allocation Map for storage pool 'system'
.....
.....
mmcrfs: Propagating the cluster configuration data
to all affected nodes. This is an asynchronous process.

Step 16: Verify GPFS Disk Status

# mmlsdisk gpfs1
disk         driver   sector failure holds    holds                            storage
name         type       size   group metadata data  status        availability pool
------------ -------- ------ ------- -------- ----- ------------- ------------ ------------
ds4200_b     nsd         512    4001 yes      yes   ready         up           system
ds4200_c     nsd         512    4002 yes      yes   ready         up           system

Step 17: Mount the file systems and checking permissions

# mmmount /gpfs1 -a
Fri Sep 11 12:50:17 EST 2012: mmmount:  Mounting file systems ...

Change Permission for /gpfs1

# chmod 777 /gpfs1

Step 18: Checking and testing of file system

Adding time for dd to test and analyse read and write performance

Step 19: Update the /etc/fstab

LABEL=/                 /                       ext3    defaults        1 1
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0
LABEL=SWAP-sda2         swap                    swap    defaults        0 0
......
/dev/gpfs1           /gpfs_data           gpfs       rw,mtime,atime,dev=gpfs1,noauto 0 0

More Information:

  1. Basic Installing and Configuring of GPFS Cluster (Part 1)
  2. Basic Installing and Configuring of GPFS Cluster (Part 2)
  3. Basic Installing and Configuring of GPFS Cluster (Part 3)
  4. Basic Installing and Configuring of GPFS Cluster (Part 4)

Basic Installing and Configuring of GPFS Cluster (Part 3)

Step 8: Starting up GPFS Daemon on all the nodes

# mmstartup -a
Fri Aug 31 21:58:56 EST 2010: mmstartup: Starting GPFS ...

Step 9: Ensure all the GPFS daemon (mmfsd) is active on all the node before proceeding

# mmgetstate -a

Node number  Node name   GPFS state
-----------------------------------
1            nsd1        active
2            nsd2        active
3            node1       active
4            node2       active
5            node3       active
6            node4       active
7            node5       active
8            node6       active

More Information:

  1. Basic Installing and Configuring of GPFS Cluster (Part 1)
  2. Basic Installing and Configuring of GPFS Cluster (Part 2)
  3. Basic Installing and Configuring of GPFS Cluster (Part 3)
  4. Basic Installing and Configuring of GPFS Cluster (Part 4)

Basic Installing and Configuring of GPFS Cluster (Part 2)

This is a continuation of Installing and configuring of GPFS Cluster (Part 1).

Step 4b: Verify License Settings (mmlslicense)

# mmlslicense
Summary information
---------------------
Number of nodes defined in the cluster:                         33
Number of nodes with server license designation:                 3
Number of nodes with client license designation:                30
Number of nodes still requiring server license designation:      0
Number of nodes still requiring client license designation:      0

Step 5a: Configure Cluster Settings

# mmchconfig maxMBps=2000,maxblocksize=4m,pagepool=2000m,autoload=yes,adminMode=allToAll
  • maxMBps specifies the limit of LAN bandwidth per node. To get peak rate, set it to approximately 2x the desired bandwidth. For InfiniBand QDR, maxMBps=6000 is recommended
  • maxblocksize specifies the maximum file-system blocksize. As the typical file-size and transaction-size are unknown, maxblocksize=4m is recommended
  • pagepool specifies the size of the GPFS cache. If you are using applications that display temporal locality, pagepool > 1G is recommended, otherwise, pagepool=1G is sufficient
  • autoload specifies whether the cluster should automatically load mmfsd when a node is rebooted
  • adminMode specifies whether all nodes allow passwordless root access (allToAll) or whether only a subset of the nodes allow passwordless root access (client).

Step 5b: Verify Cluster Settings

# mmlsconfig
Configuration data for cluster nsd1:
----------------------------------------
myNodeConfigNumber 1
clusterName nsd1-nas
clusterId 130000000000
autoload yes
minReleaseLevel 3.4.0.7
dmapiFileHandleSize 32
maxMBpS 2000
maxblocksize 4m
pagepool 1000m
adminMode allToAll

File systems in cluster nsd1:
---------------------------------
/dev/gpfs1

Step 6: Check the InfiniBand communication method and details using the ibstatus command

Infiniband device 'mlx4_0' port 1 status:

       default gid:     fe80:0000:0000:0000:0002:c903:0006:d403
        base lid:        0x2
        sm lid:          0x2
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)        
        link_layer:      InfiniBand

Step 7 (if you are using RDMA): Change the GPFS configuration to ensure RDMA is used instead of IP over IB (double the performance)

# mmchconfig verbsRdma=enable,verbsPorts=mlx4_0/1
mmchconfig: Command successfully completed
mmchconfig: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.

More Information

  1. Basic Installing and Configuring of GPFS Cluster (Part 1)
  2. Basic Installing and Configuring of GPFS Cluster (Part 2)
  3. Basic Installing and Configuring of GPFS Cluster (Part 3)
  4. Basic Installing and Configuring of GPFS Cluster (Part 4)

Basic Installing and Configuring of GPFS Cluster (Part 1)

This tutorial is a brief writeup of setting up the General Parallel Fils System (GPFS) Networked Shared Disk (NSD). For more detailed and comprehensive, do look at GPFS: Concepts, Planning, and Installation Guide. for a detailed understanding of the underlying principles of quorum manager. This tutorial only deals with the technical setup

Step 1: Preparation

All Nodes to be installed with GPFS should be installed with supported Operating System; For Linux, it should be  SLES and RHEL.

  1. The nodes should be able to communicate with each other and password-less ssh should be configured for all nodes in the cluster.
  2. Create an installation directory where you can put all the base and update rpm. For example, /gpfs_install. Copy all the
  3. Build the portability layer for each node with a different architecture or kernel level. For more information see,  Installing GPFS 3.4 Packages. For ease of installation, put all the rpm at /gpfs_install

Step 2: Export the path of GPFS commands

Remember to Export the PATH

# vim ~/.bashrc
export PATH=$PATH:/usr/lpp/mmfs/bin

Step 3: Setup of quorum manager and cluster

Just a nutshell explanation taken from GPFS: Concepts, Planning and installation Guide

Node quorum is the default quorum algorithm for GPFS™. With node quorum:

  • Quorum is defined as one plus half of the explicitly defined quorum nodes in the GPFS cluster.
  • There are no default quorum nodes; you must specify which nodes have this role.
  • For example, in Figure 1, there are three quorum nodes. In this configuration, GPFS remains active as long as there are two quorum nodes available.

Create node_spec.lst at /gpfs_install containing a list of all the nodes in the cluster

# vim node_spec.lst
nsd1:quorum-manager
nsd2:quorum-manager
node1:quorum
node2
node3
node4
node5
node6

Create the gpfs cluster using the created file

# mmcrcluster -n node_spec.lst -p nsd1 -s nsd2 -R /usr/bin/scp -r /usr/bin/ssh
Fri Aug 10 14:40:53 SGT 2012: mmcrcluster: Processing node nsd1-nas
Fri Aug 10 14:40:54 SGT 2012: mmcrcluster: Processing node nsd2-nas
Fri Aug 10 14:40:54 SGT 2012: mmcrcluster: Processing node avocado-h00-nas
mmcrcluster: Command successfully completed
mmcrcluster: Warning: Not all nodes have proper GPFS license designations.
Use the mmchlicense command to designate licenses as needed.
mmcrcluster: Propagating the cluster configuration data to all
affected nodes.  This is an asynchronous process.

-n: list of nodes to be included in the cluster
-p: primary GPFS cluster configuration server node
-s: secondary GPFS cluster configuration server node
-R: remote copy command (e.g., rcp or scp)
-r: remote shell command (e.g., rsh or ssh)

To check whether all nodes were properly added, use the mmlscluster command

# mmcluster
GPFS cluster information
========================
GPFS cluster name:         nsd1
GPFS cluster id:           1300000000000000000
GPFS UID domain:           nsd1
Remote shell command:      /usr/bin/ssh
Remote file copy command:  /usr/bin/scp

GPFS cluster configuration servers:
-----------------------------------
Primary server:    nsd1
Secondary server:  nsd2

Node  Daemon node name     IP address       Admin node name     Designation
---------------------------------------------------------------------------
1     nsd1                 192.168.5.60     nsd1-nas            quorum-manager
2     nsd2                 192.168.5.61     nsd2-nas            quorum-manager
3     node1                192.168.5.24     node1               quorum-manager

Step 4a: Setup license files (mmchliense)

Configure GPFS Server Licensing. Create a license file at /gpfs_install

# vim license_server.lst
nsd1
nsd2
node1
# mmchlicense  server --accept -N license_server.lst

The output will be

The following nodes will be designated as possessing GPFS server licenses:
nsd1
nsd2
node1
mmchlicense: Command successfully completed
mmchlicense: Propagating the cluster configuration data to all
affected nodes.  This is an asynchronous process.

Configuring GPFS Client Licensing. Create a file at /gpfs_install

# vim license_client.lst
node2
node3
node4
node5
node6
# mmchlicense client --accept -N license_client.lst

The output will be

The following nodes will be designated as possessing GPFS client licenses:
node2
node3
node4
node5
node6

mmchlicense: Command successfully completed
mmchlicense: Propagating the cluster configuration data to all
affected nodes.  This is an asynchronous process.

More information

  1. Basic Installing and Configuring of GPFS Cluster (Part 1)
  2. Basic Installing and Configuring of GPFS Cluster (Part 2)
  3. Basic Installing and Configuring of GPFS Cluster (Part 3)
  4. Basic Installing and Configuring of GPFS Cluster (Part 4)

Installing GPFS 3.4 Packages

In this work-in-progress tutorial, I will write how to install the packages and compile portability layer (gpfs.gplbin) for each kernel or  architecture

First thing first, you may have to do a  yum install for ksh and rsh

# yum install ksh rsh compat-libstdc++-33 gcc-c++ imake kernel-devel kernel-headers libstdc++ redhat-lsb

Unpacked the GPFS rpms on the nodes. Remember to unpack the gpfs.base first before installing the gpfs.base update rpm

# rpm -ivh gpfs.base-3.4.0-0.x86_64.rpm
# rpm -ivh gpfs.base-3.4.0-12.x86_64.update.rpm
# rpm -ivh gpfs.docs-3.4.0-12.noarch.rpm
# rpm -ivh gpfs.gpl-3.4.0-12.noarch.rpm
# rpm -ivh gpfs.msg.en_US-3.4.0-12.noarch.rpm

Build the portability layer based on your architecture. I’m using CentOS

# cd /usr/lpp/mmfs/src
# make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig
# make World
# make InstallImages
# make rpm

The resulting customised package will be placed in  /usr/src/redhat/RPMS/x86_64/gpfs.gplbin-2.6.18-164.el5-3.4.0-12.x86_64.rpm

# cd /usr/src/redhat/RPMS/x86_64/gpfs.gplbin-2.6.18-164.el5-3.4.0-12.x86_64.rpm
# rpm -ivh gpfs.gplbin-2.6.18-164.el5-3.4.0-12.x86_64.rpm

Related information:

  1. Adding nodes to a GPFS cluster

Adding nodes to a GPFS cluster

Assumption:

  1. You have to exchange SSH keys between GPFS nodes and Servers. For more information on key exchange, you can take a look at Auto SSH Login without Password
  2. You have installed the gpfs packages. See Installing GPFS 3.4 Packages

You must follow these rules when adding nodes to a GPFS cluster:

  • You may issue the command only from a node that already belongs to the GPFS cluster.
  • A node may belong to only one GPFS cluster at a time.
  • The nodes must be available for the command to be successful. If any of the nodes listed are not available when the command is issued, a message listing those nodes is displayed. You must correct the problem on each node and reissue the command to add those nodes.
  • After the nodes are added to the cluster, you must use the mmchlicense command to designate appropriate GPFS licenses to the new nodes.

To add node2 to the GPFS cluster, enter:

# mmaddnode -N node2
The system displays information similar to:
Mon Aug 9 21:53:30 EDT 2004: 6027-1664 mmaddnode: Processing node2
mmaddnode: Command successfully completed
mmaddnode: 6027-1371 Propagating the changes to all affected nodes.
This is an asynchronous process.

To confirm the addition of the nodes, enter:

# mmlscluster

The system displays information similar to:

GPFS cluster information
========================
  GPFS cluster name:         gpfs_cluster
  GPFS cluster id:           680681562214606028
  GPFS UID domain:           gpfs_cluster.com
  Remote shell command:      /usr/bin/rsh
  Remote file copy command:  /usr/bin/rcp

GPFS cluster configuration servers:
-----------------------------------
  Primary server:    nsd1
  Secondary server:  nsd2

 Node  Daemon node name        IP address       Admin node name         Designation
--------------------------------------------------------------------------------------
   1   nsd1                    198.117.68.68      nsd1                  quorum
   2   nsd2                    198.117.68.69      nsd2                  quorum
   3   node2                   198.117.68.70      node2

At the GPFS Clients, remember to add the path in your .bashrc

export PATH=$PATH:/usr/lpp/mmfs/bin

Update the License file of GPFS. Do make sure you have purchased your licenses from IBM. My license file is located at /gpfs_install

# vim /gpfs_install/license_client.lst
node1
node2

Issue the mmchlicense command to set the license nodes in the cluster. Make sure you have purchased the licenses from IBM.

# mmchlicense client --accept -N license_client.lst
node1
node2
mmchlicense: Command successfully completed
mmchlicense: Propagating the cluster configuration data to all affected nodes.  This is an asynchronous process.

Use the mmstartup command to start the GPFS daemons on one or more nodes. If you wish to specify only a node to startup

# mmstartup -N node2

You should see the /dev/gpfs_data mounted on the client node.

TCP/IP Optimisation

There are several techniques to optimise TCP/IP. I will mentioned 3 types of TCP/IP Optimisation

  1. TCP Offload engines
  2. User Space TCP/IP implementations
  3. Bypass TCP via RDMA

Type 1: TCP Offload

TCP OffLoad Engine (TOE) is a technology that offloads TCP/IP stack processing to the NIC. Used primarily with high-speed interfaces such as 10GbE, the TOE technology frees up memory bandwidth and valuable CPU cycles on the server, delivering the high throughput and low latency needed for HPC applications, while leveraging Ethernet’s ubiquity, scalability, and cost-effectiveness. (Taken from Delivering HPC Applications with Juniper Networks and Chelsio Communications, Juniper Networks, 2010)

For Ethernet such as 10G where the TCP/IP processing overhead is high due to the larger bandwidth compared to 1GB

A good and yet digestible write-up can be found in TCP/IP offload Engine (TOE). In the article, TCP/IP processing can be spilt into different phrases.

  1. Connection establishment
  2. Data transmission/reception
  3. Disconnection
  4. Error handling

Full TCP/IP off-loading