Performance Penalty for MPI Communication

This blog entry is summarise from the excellent article “Achieving Breakthrough MPI Performance with Fabric Collectives Offload” by Voltaire.

According to the paper,

What are MPI collectives?

MPI is the defacto standard for communication among processes that model a parallel program running on a distributed memory system.
MPI functions include point-to-point communication and group communication between many nodes
for some collectives, the operation involes mathematical group operation that is performed among the results of each professes such as summation or determining min/mx value

What prohibit x86 cluster application performance scalability?

Cluster’s network and collective operations. Collective Operations which is the group communication which has to wait for all the member of the groups to pariticpate before it can conclude. In other wors, the slowest member will impac the overall performance
Applications can spend up to 50% o 60% on collectives. The more number of nodes, the % increased in the inefficiency

Problems with collective scalability –

A. Cluster Hotspot and Congestion

Non-Blocking configuration does not eliminiate the problem even though it is providing a higher I/O throughput . This is because applications communication pattern are rarely evenly distributed and “hot-spot” do occur
Collective Messages are affected by congestion due to the “many-to-one” problem of group communication and large amount of collective messages travelling over the fabrics

B. Server OS noise

In non-real-time OS environment, many tasks and events can cause a running process to perform a context switch in favour of other tasks before returning to the collective operations after some time. This is due to the “OS noise” which includes hardware interrupts, page faults, swap-ins and preemption on the main program.

For more information on how this MPI performance can be resolves, do look at upcoming Blog Entry “Fabric-Based Collective Offload Solution“

The Linux Cluster