Network File System ( NFS ) in High Performance Networks (White Papers)

This article “Network File System ( NFS ) in High Performance Networks” by Carnegic Mellon is very interesting article about NFS Performance. Do take a look. Here is a summary of their fundings

  1. For point-to-point throughput, IP over InfiniBand (Connected Mode) is comparable to a native InfiniBand.
  2. When a disk is a bottleneck, NFS can benefit from neither IPoIB nor RMDA
  3. When a disk is not a bottleneck, NFS benefits significantly from both IPoIB and RDMA. RDMA is better than IPoIB by ~20%
  4. As the number of concurrent read operations increases, aggregate throughputs achieved for both IPoIB and RDMA significantly improve with no disadvantage for IPoIB

Tweaking the Linux Kernel to manage memory and swap usage

This writeup is assuming you are tweaking to the minimise swap and maximise physical memory. This tweaking should be considered especially for High Performance MPI applications where good low latency parallelism between nodes is very essential.

In addition, this writeup also help you to “kill” runaway memory  applications

1. Preventing Swapping unless absolutely necessary

If you have lots of RAM, you may want to use RAM as I/O caches and buffers. The benefits of using RAM as I/O caches and buffers are definitely speed when compared to swapping data.

To review the value of swappiness can be seen by running the following commands

# cat /proc/sys/vm/swappiness

To modified by running the following command (0 will prevent swapping unless absolutely required)

# echo 0 > /proc/sys/vm/swappiness

To make the settings permanent, edit /etc/sysctl.conf.


Remember to reboot.

2. Memory Management – Preventing the kernel from dishing out more memory than required

I think for us who have been running computational jobs have seen the memory got eaten up by some buggy or stray applications. Hopefully the kernel kills it. But somehow you must have seen that the kernel may not have kill the culprit and the server go to a linbo.

Let’s say if we wish to ensure that the kernel only gives out memory to processes equal to the physical memory, then we have to do the following at /etc/sysctl.conf or /etc/sysctl.d/myapp.conf

My assumption is that you have 10GB of swap and 20GB of memory and you wish the kernel to stop handling processes at  18GB RAM, then the calculation should be (swap size +  0.4 * RAM size)

So at /etc/sysctlf.conf, the configuration will be

vm.overcommit_memory = 2
vm.overcommit_ratio = 40

Note: The ratio is (40/100). For explanation of vm.overcommit_memory =2. Do look at Tweaking Linux Kernel Overcommit Behaviour for memory

Once the memory hits 18GB, the so-called OOM killer of the Linux kernel will kick in.

Another calculation example is that your RAM size and  SWAP size are the same and you wish exactly the physical memory to be used only. then

vm.overcommit_memory = 2
vm.overcommit_ratio = 0

For more information, do read

  1. Preventing Swapping unless absolutely necessary (Linux Toolkit)
  2. Speeding up boot time by Optimising Physical Memory and Swap (Linux Toolkit)
  3. Memory Management – Preventing the kernel from dishing out more memory than required (Linux Toolkit)
  4. Tweaking Linux Kernel Overcommit Behaviour for memory (Linux Toolkit)

Infiniband versus Ethernet myths and misconceptions

This paper is a good writeup of the 8 myth and misconcption of Infiniband. This whitepaper Eight myths about InfiniBand WP 09-10 is from Chelsio. Here is a summary with my inputs on selected myths……

Opinion 1: Infiniband is lower latency than Ethernet

Infiniband vendors usually advertised latency in a specialized micro benchmarks with two servers in a back‐to‐back configuration. In a HPC production environment, application level latency is what matters. Infiniband lack congestion management and adaptive routing will result in interconnect hot spots unlike iWARP over Ethernet achieve reliability via TCP.

Opinion 2: QDR‐IB has higher bandwidth than 10GbE

This is interesting. A QDR is InfiniBand uses 8b/10b encoding, 40 Gbps InfiniBand is effectively 32 Gbps. However due to the limitation of PCIe “Gen 2”,  you will hit a maximum of 26 Gbps. If you are using the PCIe “Gen 1”, you will hit a maximum of 13 Gbps. Do read Another article from Margalia Communication High-speed Remote Direct Memory Access (RDMA) Networking for HPC. Remember Chelsio Adapter comes 2 x 10GE card, you can trunk them together to come nearer to the maximum 26Gbps of Infiniband. Wait till the 40GBe comes into the market, it will be very challenging for Infinband.

Opinion 3: IB Switch scale better than 10GbE

Due to the fact that Infiniband Switch is a point-to-point switch it does not have congestion management and susceptibility to hot spots for large scale clusters unlike iWARP over Ethernet. I think we should take into account coming very low latency ASIC Switch See my blog entry Watch out Infiniband! Low Latency Ethernet Switch Chips are closing the gap and larger cut-through switches like ARISTA ultra low-latency cut-through 72-port switch with Fulcrum chipsets are in the pipeline.  Purdue University 1300-nodes cluster uses Chelsio iWARP 10GE Cards.