Open Fabrics Enterprise Distribution (OFED) has provided simple performance micro-benchmark has provided a collection of tests written over uverbs. Some notes taken from OFED Performance Tests README
- The benchmark uses the CPU cycle counter to get time stamps without a context switch.
- The benchmark measures round-trip time but reports half of that as one-way latency. This means that it may not be sufficiently accurate for asymmetrical configurations.
- Min/Median/Max results are reported.
The Median (vs average) is less sensitive to extreme scores.
Typically, the Max value is the first value measured Some CPU architectures - Larger samples only help marginally. The default (1000) is very satisfactory. Note that an array of cycles_t (typically an unsigned long) is allocated once to collect samples and again to store the difference between them. Really big sample sizes (e.g., 1 million) might expose other problems with the program.
On the Server Side
# ib_write_lat -a
On the Client Side
# ib_write_lat -a Server_IP_address
------------------------------------------------------------------ RDMA_Write Latency Test Number of qps : 1 Connection type : RC Mtu : 2048B Link type : IB Max inline data : 400B rdma_cm QPs : OFF Data ex. method : Ethernet ------------------------------------------------------------------ local address: LID 0x01 QPN 0x02ce PSN 0x1bd93e RKey 0x014a00 VAddr 0x002b7004651000 remote address: LID 0x03 QPN 0x00f2 PSN 0x20aec7 RKey 0x010100 VAddr 0x002aeedfbde000 ------------------------------------------------------------------ #bytes #iterations t_min[usec] t_max[usec] t_typical[usec] 2 1000 0.92 5.19 1.24 4 1000 0.92 65.20 1.24 8 1000 0.90 72.28 1.23 16 1000 0.92 19.56 1.25 32 1000 0.94 17.74 1.26 64 1000 0.94 26.40 1.20 128 1000 1.05 53.24 1.36 256 1000 1.70 21.07 1.83 512 1000 2.13 11.61 2.22 1024 1000 2.44 8.72 2.52 2048 1000 2.79 48.23 3.09 4096 1000 3.49 52.59 3.63 8192 1000 4.58 64.90 4.69 16384 1000 6.63 42.26 6.76 32768 1000 10.80 31.11 10.91 65536 1000 19.14 35.82 19.23 131072 1000 35.56 62.17 35.84 262144 1000 68.95 80.15 69.10 524288 1000 135.34 195.46 135.62 1048576 1000 268.37 354.36 268.64 2097152 1000 534.34 632.83 534.67 4194304 1000 1066.41 1150.52 1066.71 8388608 1000 2130.80 2504.32 2131.39
Common Options you can use.
Common Options to all tests: -p, --port=<port> listen on/connect to port <port> (default: 18515) -m, --mtu=<mtu> mtu size (default: 1024) -d, --ib-dev=<dev> use IB device <dev> (default: first device found) -i, --ib-port=<port> use port <port> of IB device (default: 1) -s, --size=<size> size of message to exchange (default: 1) -a, --all run sizes from 2 till 2^23 -t, --tx-depth=<dep> size of tx queue (default: 50) -n, --iters=<iters> number of exchanges (at least 100, default: 1000) -C, --report-cycles report times in cpu cycle units (default: microseconds) -H, --report-histogram print out all results (default: print summary only) -U, --report-unsorted (implies -H) print out unsorted results (default: sorted) -V, --version display version number