May 18, 2018 by kittycool only

Supported Platforms for ABAQUS

Supported Platform

May 14, 2018 by kittycool only

Disable FirewallD Services on CentOS 7

Do note that firewall on CentOS 7 system is enabled by default.

Step 1: To check the status of CentOS 7 FirewallD

# systemctl status firewalld.service

● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: inactive (dead)
     Docs: man:firewalld(1)

The above shows that the firewalld is disabled.

Step 2: To stop the FirewallD

# systemctl stop firewalld.service

Step 3: To completely disable the firewalld service

# systemctl disable firewalld.service

May 11, 2018 by kittycool only

Error while loading shared libraries: libXm.so.4 on CentOS 7

If you are installing something and you see this “Error error while loading shared libraries: libXm.so.4”. It is quite easy to solve. Just do the following

# yum install motif, motif-devel

May 10, 2018 by kittycool only

Resolving “lsb_release not found” on CentOS 7

I was installing ABAQUS 2017 on CentOS 7 when I encountered an error. lsb_release is the print distribution specific information. Strangely, this issue is found on CentOS 7 distribution.

[root@node-h001 1]# ./StartGUI.sh
CurrentMediaDir initial="."
CurrentMediaDir="/root/abaqus2017/AM_SIM_Abaqus_Extend.AllOS/1"
Current operating system: "Linux"
./StartGUI.sh[21]: .[31]: .: line 3: lsb_release: not found
DSY_OS_Release=""
Unknown linux release ""
exit 8

Resolving Issues

# yum install redhat-lsb-core

Verification

[root@node-h001 1]# lsb_release
LSB Version: :core-4.1-amd64:core-4.1-noarch

April 23, 2018 by kittycool only

Pre-check before restarting the NSD Nodes

Before restarting the NSD Nodes or Quorum Manager Nodes or other critical nodes, do check the following first to ensure the file system is in the right order before restarting.

1. Make sure all three quorum nodes are active.

# mmgetstate -N quorumnodes

*If any machine is not active, do *not* proceed

2. Make sure file system is mounted on machines

# mmlsmount gpfs0

If the file system is not mounted somewhere, we should try to resolve it first.

April 23, 2018 by kittycool only

Spectrum Scale User Group @ London (April)

There was a good and varied topics being discussed at the Spectrum Scale

Opening & welcome – Simon Thompson, Claire O’Toole, Ted Hoover
Update Scale (video)/ ESS / Support (video) – Mathias Dietz & Chris Maestas
MultiCloud Transparent Cloud Tiering (video) – Rob Basham
Shared NVMe for High Performance Spectrum Scale Clusters (video)- Stuart Campbell
User Talk – EBI MMAP issues (video – both speakers) – Jordi Valls / Sven Oehme
GxFS Storage Appliance at Karlsruher Institute of Technology (video) – Jan Erik Sundermann
Tooling Scale – Automation
R&S VSA (Virtual storage access) Reliable fault tolerant storage in Broadcast (video) – Oliver Gappa
Novel TCT: A brief demo on using TCT with alternative cloud gateways – Laurence Horrocks-Barlow
Ten commandments of good I/O – Rosemary Francis
Scientific Computing & Storage at The Francis Crick Institute – Michael Holliday
Mixing storage systems in Spectrum Scale – Migrations & pools stories – Luis Bolinches
File System Audit Logging / Running Spectrum Scale in a Vagrant environment – Chris Maestas
AFM Deep Dive – Tuning and debugging – Venkateswara Puvvada
User Talk – QMUL – Peter Childs
Sponsor Talk – Lenovo – Michael Hennecke
User Talk – MAX IV – Andreas Mattsson
Sponsor Talk – DDN – Vic Cornell
Cognititive, ML, Hortonworks – Yong ZY Zheng

April 11, 2018 by kittycool only

Basic Tuning of RDMA Parameters for Spectrum Scale

If your cluster has symptoms of overload and GPFS kept reporting “overloaded” in GPFS logs like the ones below, you might get long waiters and sometimes deadlocks.

Wed Apr 11 15:53:44.232 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 15:55:24.488 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 15:57:04.743 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 15:58:44.998 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 16:00:25.253 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 16:28:45.601 2018: [I] Sending 'overloaded' status to the entire cluster
Wed Apr 11 16:33:56.817 2018: [N] sdrServ: Received deadlock notification from

Increase scatterBuffersize to a Number that match IB Fabric
One of the first tuning will be to tune the scatterBufferSize. According to the wiki, FDR10 can be tuned to 131072 and FDR14 can be tuned to 262144

The default value of 32768 may perform OK. If the CPU utilization on the NSD IO servers is observed to be high and client IO performance is lower than expected, increasing the value of scatterBufferSize on the clients may improve performance.

# mmchconfig scatterBufferSize=131072

There are other parameters which can be tuned. But the scatterBufferSize worked immediately for me.
verbsRdmaSend
verbsRdmasPerConnection
verbsRdmasPerNode

Disable verbsRdmaSend=no

# mmchconfig verbsRdmaSend=no -N nsd1,nsd2

Verify settings has taken place

# mmfsadm dump config | grep verbsRdmasPerNode

Increase verbsRdmasPerNode to 514 for NSD Nodes

# mmchonfig verbsRdmasPerNode=514 -N nsd1,nsd2

References:

Best Practices RDMA Tuning

April 11, 2018 by kittycool only

Cannot initialize RDMA protocol on Cluster with Platform LSF

If you encounter this issue during an application run and your scheduler used is Platform LSF. There is a simple solution.

Symptoms

explicit_dp: Rank 0:13: MPI_Init_thread: didn't find active interface/port
explicit_dp: Rank 0:13: MPI_Init_thread: Can't initialize RDMA device
explicit_dp: Rank 0:13: MPI_Init_thread: Internal Error: Cannot initialize RDMA protocol
MPI Application rank 13 exited before MPI_Init() with status 1
mpirun: Broken pipe

Cause:
In this case the amount of locked memory was set to unlimited in /etc/security/limits.conf, but this was not sufficient.
The MPI jobs were started under LSF, but the lsf daemons were started with very small memory locked limits.

Solution:
Set the amount of locked memory to unlimited in /etc/init.d/lsf by adding the ‘ulimit -l unlimited’ command.

.....
.....
### END INIT INFO
ulimit -l unlimited
. /opt/lsf/conf/profile.lsf
.....
.....

References:

HP HPC Linux Value Pack 3.1 – Platform MPI job failed

April 9, 2018 by kittycool only

Disable SElinux in CentOS 7

1. Check the SELinux Status on CentOS 7

# sestatus

SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: enforcing
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 28

2. Disable SElinux Temporarily

# setenforce 0

2a. Check Status

# sestatus

SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: permissive
Mode from config file: permissive
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 28

3. Disable SElinux Permanently

# vim /etc/sysconfig/selinux/config

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
SELINUX=disabled
# SELINUXTYPE= can take one of three two values:
# targeted - Targeted processes are protected,
# minimum - Modification of targeted policy. Only selected processes are protected.
# mls - Multi Level Security protection.
SELINUXTYPE=targeted

3a. Check Status

# sestatus

SELinux status: disabled

April 5, 2018 by kittycool only

Nvidia GPU Technology Conference 2018

Nvidia GPU Technology Conference 2018 Hightlights (pdf)
DGX-2 Datasheet (pdf)
NGC Deep Learning Customer Deck (pdf)

The Linux Cluster

Linux Cluster Blog is a collection of how-to and tutorials for Linux Cluster and Enterprise Linux

Year: 2018