GPFS NSD Nodes stuck in Arbitrating Mode

One of our GPFS NSD Nodes are forever stuck in arbitrating nodes. One of the symptoms that was noticeable was that the users was able to log-in but unable to do a “ls” of their own directories. You can get a quick deduction by looking at one of the NSD Nodes. For this kind of issues, do a mmdiag –waiters first. There are limited articles on this

# mmdiag --waiters 

.....
.....
0x7FB0C0013D10 waiting 27176.264845756 seconds, SharedHashTabFetchHandlerThread: 
on ThCond 0x1C0000F9B78 (0x1C0000F9B78) (TokenCondvar), reason 'wait for SubToken to become stable'

References:

  1. IZ17622: GPFS DEADLOCK WAITING FOR SUBTOKEN TO BECOME STABLE CAUSES HANG
  2. GPFS File System Deadlock

Here is how PMR Solution to collect information to help resolve the issue.

The steps below will gather all the docs you could provide in terms of first time data capture given an unknown problem.   Do these steps for all your performance/hang/unknown GPFS issues WHEN the problem is occurring.  Commands are executed from one node.  Collection of the docs will vary based on the working collective created below.
.
1) Gather waiters and create working collective. It can be good to get  multiple looks at what the waiters are and how they have changed,  so doing the first mmlsnode command (with the -L) numerous times  as you proceed through the steps below  might be helpful (specially
if issue is pure performance, no hangs).
.

# mmlsnode -N waiters -L  > /tmp/allwaiters.$(date +"%m%d%H%M%S")
# mmlsnode -N waiters > /tmp/waiters.wcoll

.
View allwaiters and waiters.wcoll files to verify that these files are not empty.
.
If either (or both) file(s) are empty, this indicates that the issues seen are not GPFS waiting on any of it’s threads.  Docs to be gathered in this case will vary.  Do not continue with steps.  Tell Service person and they will determine the best course of action and what docs will be needed.
.
2) Gather internaldump from all nodes in the working collective
.

# mmdsh -N /tmp/waiters.wcoll "/usr/lpp/mmfs/bin/mmfsadm dump all > /tmp/\$(hostname -s).dumpall.\$(date +"%m%d%H%M%S")"

.
3) Gather kthreads from all nodes in the working collective
.
Depending on various factors, this command can take a long time
to complete.   If not specifically looking for kernel threads, this
step can be skipped. If command is running it can stopped by
ctrl-C.
.

# mmdsh -N /tmp/waiters.wcoll "/usr/lpp/mmfs/bin/mmfsadm dump kthreads > /tmp/\$(hostname -s).kthreads.\$(date +"%m%d%H%M%S")"

.
4) If this is a performance problem, get 60 seconds mmfs trace from the
nodes in the working collective.
.
If AIX …
.

# mmtracectl --start --aix-trace-buffer-size=64M --trace-file-size=128M -N /tmp/waiters.wcoll ; sleep 60; mmtracectl --stop -N /tmp/waiters.wcoll

.
If Linux ..
.

# mmtracectl --start i--trace-file-size=128M -N /tmp/waiters.wcoll ; sleep 60; mmtracectl --stop -N /tmp/waiters.wcoll

.
5) Gather gpfs.snap from same nodes.
.

# gpfs.snap -N /tmp/waiters.wcoll

.
Gather the docs taken. Steps 1) and 5) will be on the local node, in /tmp and /tmp/gpfs.snapOut respectively and steps 2) and 3) will be in /tmp on the nodes represented in the waiters.wcoll file. The gpfs.snap will pick up the trcrpt in /tmp/mmfs

Many times steps 3) and 4) are not needed unless asked for.  If supplied they may or may not be used.  If there are any issues collecting doc, Steps 1), 2) and 5) are the most critical.


Solution:

1) The all waiters show:

nsd1:  0x2AAAACC659F0 waiting 31358.847013000 seconds, GroupProtocolDriverThread: 
on ThCond 0x5572138 (0x5572138) (MsgRecordCondvar), reason 'RPC wait' for ccMsgGroupLeave
nsd1:  0x2AAAACC659F0 waiting 31358.847013000 seconds, GroupProtocolDriverThread: 
on ThCond 0x5572138 (0x5572138) (MsgRecordCondvar), reason 'RPC wait' for ccMsgGroupLeave

2) Looking at the tscomm section to see which node is “pending”:

Output for mmfsadm dump tscomm on nsd1
######################################################################

Pending messages:
msg_id 345326326, service 1.1, msg_type 26 'ccMsgGroupLeave', n_dest 470, n_pending 1
this 0x5571F90, n_xhold 1, cl 0, cbFn 0x0, age 33501 sec
sent by 'GroupProtocolDriverThread' (0x2AAAACC659F0)

.
.
.
dest <c0n3>          status pending   , err 0, reply len 0
c0n3> 10.x.x.x/0, x.y.y.u (nsd2)

3) Waiters for nsd2 show the following:

nsd2:  0x2AAAAC9F5A50 waiting 193857.401337000 seconds, NSDThread: 
on ThCond 0x2AAAC01CA600 (0x2AAAC01CA600) (VERBSEventWaitCondvar), reason 'waiting for RDMA write DTO completion'
nsd2:  0x2AAAAC9F33D0 waiting 193856.387375000 seconds, NSDThread: 
on ThCond 0x2AAAD806B190 (0x2AAAD806B190) (VERBSEventWaitCondvar), reason 'waiting for RDMA write DTO completion'
nsd2:  0x2AAAAC9F2090 waiting 193857.691998000 seconds, NSDThread: 
on ThCond 0x2AAAD40A0F90 (0x2AAAD40A0F90) (VERBSEventWaitCondvar), reason 'waiting for RDMA write DTO completion'
nsd2:  0x2AAAAC9DC610 waiting 193857.589074000 seconds, NSDThread: 
on ThCond 0x2AAAC81B2DE0 (0x2AAAC81B2DE0) (VERBSEventWaitCondvar), reason 'waiting for RDMA read DTO completion'
nsd2:  0x2AAAAC9D8C50 waiting 193857.406763000 seconds, NSDThread: 
on ThCond 0x2AAAC01FE5E0 (0x2AAAC01FE5E0) (VERBSEventWaitCondvar), reason 'waiting for RDMA write DTO completion'
nsd2:  0x2AAAAC9CDF10 waiting 193857.692074000 seconds, NSDThread: 
on ThCond 0x2AAAD806F120 (0x2AAAD806F120) (VERBSEventWaitCondvar), reason 'waiting for RDMA write DTO completion'
nsd2:  0x2AAAAC9CB890 waiting 193857.686966000 seconds, NSDThread: 
on ThCond 0x2AAABC140880 (0x2AAABC140880) (VERBSEventWaitCondvar), reason 'waiting for RDMA write DTO completion'
nsd2:  0x2AAAAC9C31D0 waiting 193857.412257000 seconds, NSDThread: 
on ThCond 0x2AAAACD83400 (0x2AAAACD83400) (VERBSEventWaitCondvar), reason 'waiting for RDMA write DTO completion'

Do a “mmfsadm dump verbs” from all of the NSD nodes.

# mmfsadmn dump verbs

To fix this issue, stop and restart the GPFS daemon on nsd2.

# mmshutdown -N nsd2
# mmstartup -N nsd2

Error when installing xCAT 2.8.2

If you have error such as those below,

Error: Package: xCAT-2.8.2-snap201307222333.x86_64 (xcat-2-core)
Requires: conserver-xcat
Error: Package: xCAT-2.8.2-snap201307222333.x86_64 (xcat-2-core)
Requires: syslinux-xcat
Error: Package: xCAT-2.8.2-snap201307222333.x86_64 (xcat-2-core)
Requires: elilo-xcat
Error: Package: xCAT-2.8.2-snap201307222333.x86_64 (xcat-2-core)
Requires: ipmitool-xcat >= 1.8.9
Error: Package: 1:xCAT-genesis-scripts-x86_64-2.8.2-snap201307222333.noarch (xcat-2-core)
Requires: xCAT-genesis-base-x86_64
Error: Package: xCAT-2.8.2-snap201307222333.x86_64 (xcat-2-core)
Requires: xnba-undi

Go to http://sourceforge.net/projects/xcat/files/yum/xcat-dep/rh6/x86_64/ and find the respective rpms that fit the package name

 

 

 

Compiling SQLite 3.8.1 on CentOS 5

Taken from Beyond Linux® From Scratch – Version 2013-11-30

SQLite 3.8.1

  1. Download (HTTP): http://sqlite.org/2013/sqlite-autoconf-3080100.tar.gz
  2. Download size: 1.9 MB

SQLite 3.8.1 Documents

  1. Download (HTTP): http://sqlite.org/2013/sqlite-doc-3080100.zip
  2. Download size: 4.1 MB

Compiling SQLite 3.8.1

# ./configure --prefix=/usr/local/sqlite-3.8.1 --disable-static        \
CFLAGS="-g -O2 -DSQLITE_ENABLE_FTS3=1 \
-DSQLITE_ENABLE_COLUMN_METADATA=1     \
-DSQLITE_ENABLE_UNLOCK_NOTIFY=1       \
-DSQLITE_SECURE_DELETE=1" &&
# make
# make install

Compiling SQLite Documents

# install -v -m755 -d /usr/local/sqlite-3.8.1/share/doc/sqlite-3.8.1 &&
# cp -v -R sqlite-doc-3080100/* /usr/local/sqlite-3.8.1/share/doc/sqlite-3.8.1

For more information,

Command Explanations (taken from Beyond Linux® From Scratch – Version 2013-11-30)

–disable-static: This switch prevents installation of static versions of the libraries.

CFLAGS=”-g -O2 -DSQLITE_ENABLE_FTS3=1 -DSQLITE_ENABLE_COLUMN_METADATA=1 -DSQLITE_SECURE_DELETE -DSQLITE_ENABLE_UNLOCK_NOTIFY=1“: Applications such as Firefox require secure delete and enable unlock notify to be turned on. The only way to do this is to include them in the CFLAGS. By default, these are set to “-g -O2” so we specify that to preserve those settings. You may, of course, wish to omit the ‘-g’ if you do not wish to create debugging information. For further information on what can be specified see http://www.sqlite.org/compile.html.

Installing HTseq for python 26 for CentOS 6

Installing HTseq is very straightforward for CentOS 6. You will need to do just follow the installation manual

# yum install python-devel numpy python-matplotlib

Download and untar the HTSeq source files.

# tar -zxvf HTSeq-0.5.4p5.tar.gz
# cd HTSeq-0.5.4p5

Inside the expanded HTSeq home directories, to make HTSeq available for all users

# python setup.py build
# python setup.py install

For more information, do look at

  1. HTSeq Prerequisites and Installation

Installing Torque 4.2.5 on CentOS 6

References:

Do take a look at the Torque Admin Manual

Step 1: Download the Torque Software from Adaptive Computing

Download the Torque tarball from Torque Resource Manager Site

Step 2: Ensure you have the gcc, libssl-devel, and libxml2-devel packages

# yum install libxml2-devel openssl-devel gcc gcc-c++

Step 3: Configure the Torque Server

./configure \
--prefix=/opt/torque \
--exec-prefix=/opt/torque/x86_64 \
--enable-docs \
--disable-gui \
--with-server-home=/var/spool/torque \
--enable-syslog \
--with-scp \
--disable-rpp \
--disable-spool \
--enable-gcc-warnings \
--with-pam

Step 4: Compile the Torque

# make -j8
# make install

Step 5: Configure the trqauthd daemon to start automatically at system boot for the PBS Server

# cp contrib/init.d/trqauthd /etc/init.d/
# chkconfig --add trqauthd
# echo /usr/local/lib > /etc/ld.so.conf.d/torque.conf
# ldconfig
# service trqauthd start

Step 6: Copy the pbs_server and pbs_sched daemon for the PBS Server

# cp contrib/init.d/pbs_server /etc/init.d/pbs_server
# cp contrib/init.d/pb_sched /etc/init.d/pbs_sched

Step 6: Initialize serverdb by executing the torque.setup script for the PBS Server

# ./torque.setup root

Step 7: Make self-extracting tarballs packages for Client Nodes

# make packages
Building ./torque-package-clients-linux-i686.sh ...
Building ./torque-package-mom-linux-i686.sh ...
Building ./torque-package-server-linux-i686.sh ...
Building ./torque-package-gui-linux-i686.sh ...
Building ./torque-package-devel-linux-i686.sh ...
Done

Step 7b. Run libtool –finish /opt/torque/x86_64/lib

libtool: finish: PATH="/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/share/xcat/tools:/usr/lib64/qt-3.3/bin:/usr/local/intel/composer_xe_2011_sp1.11.339/bin/intel64:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/opt/ibutils/bin:/usr/local/intel/composer_xe_2011_sp1.11.339/mpirt/bin/intel64:/opt/maui/bin:/opt/torque/x86_64/bin:/root/bin:/sbin" ldconfig -n /opt/torque/x86_64/lib
----------------------------------------------------------------------
Libraries have been installed in:
/opt/torque/x86_64/lib

If you ever happen to want to link against installed libraries
in a given directory, LIBDIR, you must either use libtool, and
specify the full pathname of the library, or use the `-LLIBDIR'
flag during linking and do at least one of the following:
- add LIBDIR to the `LD_LIBRARY_PATH' environment variable
during execution
- add LIBDIR to the `LD_RUN_PATH' environment variable
during linking
- use the `-Wl,-rpath -Wl,LIBDIR' linker flag
- have your system administrator add LIBDIR to `/etc/ld.so.conf'

See any operating system documentation about shared libraries for
more information, such as the ld(1) and ld.so(8) manual pages.
----------------------------------------------------------------------

 

Step 8a: Copy and install on the Client Nodes

for i in node01 node02 node03 node04 ; do scp torque-package-mom-linux-i686.sh ${i}:/tmp/. ; done
for i in node01 node02 node03 node04 ; do scp torque-package-clients-linux-i686.sh ${i}:/tmp/. ; done
for i in node01 node02 node03 node04 ; do ssh ${i} /tmp/torque-package-mom-linux-i686.sh --install ; done
for i in node01 node02 node03 node04 ; do ssh ${i} /tmp/torque-package-clients-linux-i686.sh --install ; done

Step 8b: Alternatively, you can use xCAT to push and run the packages from the PBS Server to the Client Node (auuming you install XCAT on the PBS Server)

# pscp  torque-package-mom-linux-i686.sh compute_noderange:/tmp
# pscp torque-package-clients-linux-i686.sh compute_noderange:/tmp
# psh compute_noderange:/tmp/torque-package-mom-linux-i686.sh
# psh compute_noderange:/tmp/torque-package-clients-linux-i686.sh

Step 9: Enabling Torque as a service for the Client Node

# cp contrib/init.d/pbs_mom /etc/init.d/pbs_mom
# chkconfig --add pbs_mom

Step 10a: Start the Services for each of the client nodes

# service pbs_mom start

Step 10b: Alternatively, Use XCAT to start the service for all the Client Node

# psh compute_noderange "/sbin/service/pbs_mom start"

Adding and Specifying Compute Resources at Torque

This blog entry is the follow-up of Installing Torque 2.5 on CentOS 6 with xCAT tool.

After installing of Torque on the Head Node and Compute Node, the next things to do is to configure the Torque Server. In this blog entry, I will focus on the Configuring the Compute Resources at Torque Server

 Step 1: Adding Nodes to the Torque Server

# qmgr -c "create node node01"

Step 2: Configure Auto-Detect Nodes CPU Detection. Setting auto_node_np to TRUE overwrites the value of np set in $TORQUEHOME/server_priv/nodes

# qmgr -c "set server auto_node_np = True"

Step 3: Start the pbs_mom of the compute nodes, the torque server will detect the nodes automatically

# service pbs_mom start

Using Eclipse IDE with Intel C++ Compilers on CentOS

This article is taken from Intel C++ Compiler with the Eclipse IDE on Linux

Introduction
Intel C++ Compilers for Linux can be used together with the Eclipse IDE to create C/C++ applications. Via an Intel C++ Eclipse extension the compiler is integrated using the well-known Eclipse C/C++ Development Tooling (CDT) plug-in. Hence all existing features of CDT, like different views, wizards, a powerful editor, and debugging, can be easily used with the Intel compiler as well. In the following a “How-to” guide is provided which explains configuration and usage.


Requirements

  1. Eclipse 3.7, 3.8 or 4.2 ans above
    [http://www.eclipse.org/downloads/]
  2. CDT 8.0 or later
    [http://www.eclipse.org/cdt/]
  3. Java Runtime Environment (JRE) version 6.0 (also called 1.6) update 11 or later
    [http://www.oracle.com/technetwork/java/javase/downloads/index.html]
  4. Intel® Composer XE 2013 and above (separate or any suite that provides it, like Intel® Parallel Studio XE 2013)
    [http://software.intel.com/en-us/intel-composer-xe]

Note:
In case Eclipse has to be installed first, use the package Eclipse IDE for C/C++ Developers. It already comes with everything needed for C/C++ development. We will use it as reference in the following.

Installing the Integration
The following is a brief overview about how to install the Intel C++ Eclipse extension. More information see Learn More below.

  1. Open the Install dialog for plug-ins via menu Help->Install New Software…:
  2. Click on the Add… button and the Add Repository dialog opens:
  3. Click on the Local… button, specify the directory containing the Intel C++ Eclipse extension and confirm. The Intel C++ Eclipse extension can be found in the installation directory of Intel Composer XE, subdirectory eclipse_support/cdt8.0/eclipse.
  4. Back in the Install dialog select the item Intel(R) C++ Compiler XE 13.0 for Linux* OS and continue by pressing the button Next >.
    Optionally you can also install compiler documentation (recommended) and Intel® Debugger support for native & Intel® MIC architecture (provided they are already installed with Intel Composer XE).
  5. In case there are no items listed, ensure that Group items by category is not selected.
  6. The next dialog summarizes all plug-ins to install. Continue via button Next >:
  7. Finally, the license files are displayed. Make sure to read them. Accept and start installation by clicking on button Finish:
  8. Eventually you will be faced with a warning about unsigned content. Confirm by clicking on button OK:
  9. After installation is complete, restart Eclipse

Using Intel C++ Eclipse Extension

  1. Once the Intel C++ Eclipse extension is installed it can be used for all C/C++ projects – new ones as well as existing ones.
  2. When using the extension, make sure to source the compiler scripts before starting Eclipse:
    $ source <composer_xe_path>/bin/compilervars.[sh|csh] [ia32|intel64]
    $ eclipse
  3. This is crucial to locate the compiler installation. See the compiler documentation for more information about the compiler scripts.
  4. If you experience issues with the integration try to set the locale to en_US when starting Eclipse, e.g.:
    $ LANG=en_US eclipse

 

Create New Project

  1. To create a new C/C++ project, use the Eclipse/CDT wizard via File->New->C Project or C++ Project;
  2. By default the flag Show project types and toolchains only if they are supported on the platform is selected. Thus, all toolchains are shown for which there is an existing compiler installation. Select the toolchains for your project – multiple can be selected at once. To use the latest compiler from Intel Composer XE 2013, select version v13.0.0. It is also possible to use older versions in addition as long as there are existing compiler installations.
    When unchecking the flag Show project types and toolchains only if they are supported on the platform, all toolchains are shown, even if no appropriate compiler is installed on the local system. This can be used for environments with distributed build systems where not all nodes have all compilers installed, but only subsets each. Those toolchains can’t be used unless the proper compiler is installed but they will be present and can be configured.

Once a new project is created like this building, linking, executing and debugging is no different than used from CDT with the default toolchain.

For more information, do see Intel C++ Compiler with the Eclipse IDE on Linux

Compiling and Installing Meep-1.2.1 on CentOS 6 and OpenMPI

Meep (or MEEP) is a free finite-difference time-domain (FDTD) simulation software package developed at MIT to model electromagnetic systems, along with our MPB eigenmode package. The latest official version is 1.2 and can be found at  Download Page for Meep

But there is a Lapack linking problem for 1.2 which is explained in Error when compiling Meep-1.2 on CentOS. It is strongly recommended to use the pre-release Meep 1.2.1 found at http://jdj.mit.edu/~stevenj/meep-1.2.1.tar.gz

Before you compile Meep 1.2.1, you need to first compile the libctl library. Compiling the libctl library is quite straightforward. After downloading,

Step 1: Compiling libctl-3.2.1

# tar -zxvf libctl-3.2.1.tar.gz
# cd libctl-3.2.1
# ./configure --prefix=/usr/local/libctl-3.2.1
# make -j8
# make install

Step 2: Other Prerequisites include guile and guile-devel. Do make sure you install these 2 packages which can be done

# yum install guile guile-devel

Step 3: Compiling OpenMPI,
Do look at Compiling OpenMPI 1.6.5 with Intel 12.1.5 on CentOS 6

Step 4: Compiling meep-1.2.1

# tar -zxvf meep-1.2.1.tar.gz
# cd meep-1.2.1
#  ./configure --prefix=/usr/local/meep-1.2.1 \
--with-libctl=/usr/local/libctl-3.2.1/share/libctl/ \
LDFLAGS=-L/usr/local/libctl-3.2.1/lib \
CPPFLAGS=-I/usr/local/libctl-3.2.1/include \
--with-mpi

References:

  1. Undefined reference to `dgetrf_’ error when compiling Meep-1.2 on CentOS
  2. crtbegin.o: No such file: No such file or directory Error when compiling Meep-1.2 on CentOS
  3. My opinion: Compiling Meep

Compiling GNU Scientific Library (GSL) gsl-1.16 on CentOS 6

The GNU Scientific Library (GSL) is a numerical library for C and C++ programmers. It is free software under the GNU General Public License.

The library provides a wide range of mathematical routines such as random number generators, special functions and least-squares fitting. There are over 1000 functions in total with an extensive test suite.

Step 1: The current version of GSL is gsl-1.16.tar.gz

Step 2: You may want to use the latest GCC 4.8.1 to compile. For more information on how to compile GCC 4.8.1, see Compiling GNU 4.8.1 on CentOS 6. This compilation will help to fix all the components required for gsl-1.16

Step 3: After packing gsl, To compile

# cd /root/gsl-1.15/
# mkdir build-gsl
# cd build-gsl
# ../configure --prefix=/usr/local/gsl-1.16/
# make 
# make install

Compiling and installing FFTW 3.3.3 with OpenMPI and OpenMP

This Blog entry is an extension of the Compiling and installing FFTW 3.3.3

To Compile FFTW 3.3.3 single precision with OpenMPI, make sure you compile and set your path for Intel and OpenMPI. You may want to get more information from Compiling OpenMPI 1.6.5 with Intel 12.1.5 on CentOS 6

Step 1: Compiling FFTW 3.3.3 (Single Precision) with OpenMPI and OpenMP

After unpacking FFTW 3.3.3, you may want to use the flags

# ./configure CC=icc 
--enable-float --enable-threads --enable-openmp \
--enable-mpi MPICC=mpicc \
LDFLAGS=-L/usr/local/openmpi/intel/lib CPPFLAGS=-I/usr/local/openmpi/intel/include \ 
--prefix=/usr/local/fftw-3.3.3-single
# make -j8
# make install

Inside /usr/local/fftw-3.3.3-single/lib, you should see at least the files below

libfftw3f.a
libfftw3f_mpi.a
libfftw3f_omp.a
libfftw3f_threads.a
....
....

Step 2: Compiling FFTW 3.3.3 (Double Precision) with OpenMPI and OpenMP

# ./configure CC=icc 
--enable-threads --enable-openmp \
--enable-mpi MPICC=mpicc \
LDFLAGS=-L/usr/local/openmpi/intel/lib CPPFLAGS=-I/usr/local/openmpi/intel/include \ 
--prefix=/usr/local/fftw-3.3.3-single
# make -j8
# make install

Inside /usr/local/fftw-3.3.3-double/lib, you should see at least the files below

libfftw3.a
libfftw3_mpi.a
libfftw3_omp.a
libfftw3_threads.a
....
....