Ways to speed up your program

This is an interesting writeup on various ways to speed up your application. This is useful if you are getting into HPC for the first time. The author POR IVICA BOGOSAVLJEVIĆ suggested various ways

  • Distributing workload to multiple CPU cores
  • Distributing workload to accelerators
  • Usage of vectorization capabilities of your CPU
  • Optimizing for the memory subsystem
  • Optimizing for the CPU’s branch prediction unit



Installing Voltaire QDR Infiniband Drivers for CentOS 5.4

OS Prerequisites 

  1. RedHat EL4
  2. RedHat EL5
  3. SuSE SLES 10
  4. SuSE SLES 11
  5. Cent OS 5

Software Prerequisites 

  1. bash-3.x.x
  2. glibc-2.3.x.x
  3. libgcc-3.4.x-x
  4. libstdc++-3.4.x-x
  5. perl-5.8.x-x
  6. tcl 8.4
  7. tk 8.4.x-x
  8. rpm 4.1.x-x
  9. libgfortran 4.1.x-x

Step 1: Download the Voltaire Drivers that is fitting to your OS and version.

Do find the link for Voltaire QDR Drivers at Download Voltaire OFED Drivers for CentOS

Step 2: Unzip and Untar the Voltaire OFED Package

# bunzip2 VoltaireOFED-1.5_3-k2.6.18-164.el5-x86_64.tar.bz
# tar -xvf VoltaireOFED-1.5_3-k2.6.18-164.el5-x86_64.tar

Step 3: Install the Voltaire OFED Package

# cd VoltaireOFED-1.5_3-k2.6.18-164.el5-x86_64
# ./install

Step 3a: Reboot the Server

Step 4: Setup ip-over-ib

# vim /etc/sysconfig/network-scripts/ifcfg-ib0
# Voltaire Infiniband IPoIB
# service openibd start

Step 5 (Optional): Disable yum repository.

If you plan to use yum to local install the opensmd from the Voltaire package directory, you can opt for disabling the yum.

# vim /etc/yum.conf

Type the following at /etc/yum.conf


Step 6: Install Subnet Manager (opensmd). This can be found under

# cd  $VoltaireRootDirectory/VoltaireOFED-1.5_3-k2.6.18-164.el5-x86_64/x86_64/2.6.18-164.15.1.el5

Yum install the opensmd packages

# yum localinstall opensm* --nogpgcheck

Restart the opensmd service

# service opensmd start

Step 7: Check that the Infiniband is working

# ibstat

 You should get “State: Active”

CA 'mlx4_0'
        CA type: MT26428
        Number of ports: 1
        Firmware version: 2.6.0
        Hardware version: a0
        Node GUID: 0x0008f1476328oaf0
        System image GUID: 0x0008fd6478a5af3
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 40
                Base lid: 2
                LMC: 0
                SM lid: 14
                Capability mask: 0x0251086a
                Port GUID: 0x0008f103467a5af1

Step 8: Test Connectivity

At the Server side,

# ibping -S

Do Step 1 to 7 again for the Client. Once done,

# ibping -G 0x0008f103467a5af1 (PORT GUID)

You should see a response like this.

Pong from headnode.cluster.com.(none) (Lid 2): time 0.062 ms
Pong from headnode.cluster.com.(none) (Lid 2): time 0.084 ms
Pong from headnode.cluster.com.(none) (Lid 2): time 0.114 ms
Pong from headnode.cluster.com.(none) (Lid 2): time 0.082 ms
Pong from headnode.cluster.com.(none) (Lid 2): time 0.118 ms
Pong from headnode.cluster.com.(none) (Lid 2): time 0.118 ms

Great! you are done.

Tuning NFSD Server Daemon for Performance

Do note that NFSD Daemon play an important component in performance tuning. Here are some tips

  1. Number of Instances of the NFSD Server Daemon. By default, the instances of NFSD = 8. From Optimizing NFS Performance, the author recommend  that system admin should use at the very least one daemon per processor, but four to eight per processor may be a better rule of thumb. To modify the number of nfsd, you can edit the RPCNFSDCOUNT at the NFS startup script (/etc/rc.d/init.d/nfs on RHEL, Fedora or CentOS)
  2. If you want to determine the nfsd yourself, you can look at the NFS statistics in details which are provided by the Linux kernel at /proc/net/rpc/nfsd
  3. A sample of /proc/net/rpc/nfsd

    rc 0 47750055 170015423
    fh 39 0 0 0 0
    io 376475178 3831903891
    th 8 18573687 48505.610 3718.131 2831.176 0.000 1813.483 1468.532 1399.593 1551.349 0.000 12224.473
    ra 16 122635704 971110 83992 77018 15770 11434 1655 550 882 407 518440
    net 217768755 0 217768891 1072
    rpc 217765688 0 0 0 0
    proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    proc3 22 3 24906977 238795 7255551 10595346 837 124313278 42671631 2419345 5043 5865 0 2399297 5130 2560 1593 48707 133600 34910 3 0 2650721
    proc4 2 0 0

  4. To analyse some of the output parameters, I’ll be drawing most of the information below from an excellent article “Understanding Linux nfsd statistics”. A brief summary is as followed:
    rc reports the stats for the NFS reach cache. The three numbers are cache hits, cache misses, and”nocache” which is presumably requests that bypassed the cache.
    io reports the overall I/O counter. The 2 numbers are bytes read, bytes written
    th reports the nfsd thread utilization. The first number is the numberof nsfd thread configured. The second number of times any thread is used. The remaining ten numbers are histogram representing a 10% range of thread utilisation in seconds
    ra reports the read-ahead cache. The first number is the  read-ahead cache size. The next 10 numbers are the number of times an entry was found in the read-ahead cache < 10%, < 20%, …, < 100% in to the cache. The last number on this line is the number of times an entry was not found in the cache.

Tuning NFS Server exports file for performance

Tuning NFS Server exports file (/etc/exports) for performance. As far as I know, these 2 options are the most important

  1. async: The default export behavior for both NFS Version 2 and Version 3 protocols, used by exportfs is “asynchronous”. According to Optimizing NFS Performance. This default permits the server to reply to client requests as soon as it has processed the request and handed it off to the local file system, without waiting for the data to be written to stable storage. This is indicated by the async option denoted in the server’s export list. It yields better performance at the cost of possible data corruption if the server reboots while still holding unwritten data and/or metadata in its caches. This possible data corruption is not detectable at the time of occurrence, since the async option instructs the server to lie to the client, telling the client that all data has indeed been written to the stable storage, regardless of the protocol used.
  2. no_subtree_check: For NFS version 1.0.x and above, To speed up transfer, disable subtree check especially if you are exporting large directory.
/tmp *(rw,async,no_subtree_check)

For other good materials on the /etc/exports, do check out

Dealing with Overflow of Fragmented Packets

Most of my information written in this blog can be found at NFS for clusters and Optimizing NFS Performance

One method to check for fragmented packets issues with the NFS Server is to use the IP: ReasmFails in the file /proc/net/snmp

# head -2 /proc/net/snmp | cut -d' ' -f17

ReasmFails represents the number of fragment reassembly failures, if the ReasmFails goes up too quickly during heavy file activity, it means that the system may be having issues

According to Optimising NFS Performance, if the network topology is too complex,  fragment routes may differ, and may not all arrive at the Server for reassembly.  Once the number of unprocessed, fragmented packets reaches the number specified by ipfrag_high_thresh (in bytes), the NFS Server kernel will simply start throwing away fragmented packets until the number of incomplete packets reaches the number specified by ipfrag_low_thresh.

You can reduce the number of lost packets on the server by increasing the buffer size for fragmented packets.

$ echo 524288 > /proc/sys/net/ipv4/ipfrag_low_thresh
$ echo 524288 > /proc/sys/net/ipv4/ipfrag_high_thresh

which is doubling the defaults

Configure TCP for faster connections and transfers

On a default Linux Box, the TCP settings may not be optimise for “bigger” available network bandwidth connections and transfer available for 100MB+. Currently, most TCP settings are optimise for 10MB settings. I’m relying on the article from Linux Tweaking from SpeedGuide.net to configure the TCP

The TCP Parameters to be configured are

/proc/sys/net/core/rmem_max – Maximum TCP Receive Window
/proc/sys/net/core/wmem_max – Maximum TCP Send Window
/proc/sys/net/ipv4/tcp_timestamps – timestamps (RFC 1323) add 12 bytes to the TCP header
/proc/sys/net/ipv4/tcp_sack – tcp selective acknowledgements.
/proc/sys/net/ipv4/tcp_window_scaling – support for large TCP Windows (RFC 1323). Needs to be set to 1 if the Max TCP Window is over 65535.

There are 2 methods to apply the changes.

Methods 1: Editing the /proc/sys/net/core/. However, do note that the settings will be lost on reboot.

echo 256960 > /proc/sys/net/core/rmem_default
echo 256960 > /proc/sys/net/core/rmem_max
echo 256960 > /proc/sys/net/core/wmem_default
echo 256960 > /proc/sys/net/core/wmem_max
echo 0 > /proc/sys/net/ipv4/tcp_timestamps
echo 1 > /proc/sys/net/ipv4/tcp_sack
echo 1 > /proc/sys/net/ipv4/tcp_window_scaling

Method 2: For a more permanent settings, you have to configure /etc/sysctl.conf.

net.core.rmem_default = 256960
net.core.rmem_max = 256960
net.core.wmem_default = 256960
net.core.wmem_max = 256960
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1

Execute sysctl -p to make these new settings take effect.

Configuring NFS Server for Performance

How the NFS Server exports the file system plays an important part in the overall performance of NFS. Do read for  the NFS Client Recommended Configuration in this same blog.

  1. Tuning NFSD Server Daemon for Performance
  2. Dealing with Overflow of Fragmented Packets
  3. Configure TCP for faster connections and transfers
  4. Turning Off Autonegotiation of NICs and Hubs (optional)
  5. Tuning NFS Server exports file for performance (/etc/exports)
  6. Network Design consideration for NFS

For more in-depth information,  Optimizing NFS Performance

Configuring NFS Client for Performance

How the NFS Client mounts the file system do have some impacts on the performance of the NAS boxes. There are some NFS mount options that we can use. I’m assuming we are using NFSv3.

  1. Use the tcp option when possible.  UDP performance is better when the networked is light, but TCP option is more efficient  when the system load is heavy. When using TCP, a single dropped packet can be retransmitted, without the retransmission of the entire RPC request, resulting in better performance on lossy networks. In addition, TCP will handle network speed differences better than UDP, due to the underlying flow control at the network level.
  2. Use the hard option to continue to retry the NFS operation and not return an error to the user application performing the I/O
  3. rsize and wsize specify the size of the chunks of data that the client and server pass back and forth to each other. If no rsize and wsize options are specified, the default varies by which version of NFS we are using. To maximise the read / write, use rsize=32768, wsize=32768.
  4. By default, Everytime a client reads from a file, the server must update the server’s inode time stamp for most recently accessed time. This will lead to a performance penalty. Performance should improve by adding the noatime flag
  5. For heavily loaded server, you may want to increase the timeout to 2 seconds, timeo=20 to avoid overloading the server.
  6. To have more reliability when the server is heavily loaded, retrans=10 so that the server retry the RPC commands 10 times instead of the default 3
  7. Caching Parameters
    1. acregmin=n.  The  minimum time (in seconds) that the NFS client caches attributes of a regular file before it requests fresh attribute information from a server. The default is 3 seconds. No need to tweak the parameter
    2. acregmax=n. The  maximum time (in seconds) that the NFS client caches attributes of a regular file before it requests fresh attribute information from a server. The default is 60. It is recommended to tweak the parameter to 10 ie agremax=10
    3. acdirmin=n. The  minimum  time  (in  seconds) that the NFS client caches attributes of a directory before it requests fresh attribute information from a server. Recommended acdirmin=0
    4. acdirmax=n. The  maximum  time  (in  seconds) that the NFS client caches attributes of a directory before it requests fresh attribute information from a server. Recommended acdirmax=0
  8. Last but not least. There is no one configuration to fit all the possible application usages or file system usage. It take a lot of tweaking and testing to find the final sweet-spot.

Putting it all together, we have….

nas:/home    /home     nfs   hard,intr,tcp,rsize=32768,wsize=32768,noatime,timeo=20,acdirmin=0,acdirmax=0,acregmax=10     
 0  0


  • intr refer that the NFS operation can be interuppted.
  • First 0 refer that the dump program does not need to backup the file system.
  • the 2nd 0 refer that the fsck program does not need to check the fils system at boot time

Much of the information, I have written are found on

  1. Optimising NFS Performance (nfs.sourceforge.net)
  2. NFS for Clusters (billharlan.com)
  3. Why are changes made on an NFS share on my Red Hat Enterprise Linux 5 client not immediately visible to other NFS clients? (redhat.com)
  4. Problems with Linux NFS (smorgasbork.com)