Deep Learning Training Performance with Nvidia A100 and V100 on Dell EMC PowerEdge R7525 Servers

Articles from:  Deep Learning Training Performance on Dell EMC PowerEdge R7525 Servers with NVIDIA A100 GPUs

CUDA Basic Linear Algebra

  • For FP16, the HGEMM TFLOPs of the NVIDIA A100 GPU is 2.27 times faster than the NVIDIA V100S GPU.
  • For FP32, the SGEMM TFLOPs of the NVIDIA A100 GPU is 1.3 times faster than the NVIDIA V100S GPU.
  • For TF32, performance improvement is expected without code changes for deep learning applications on the new NVIDIA A100 GPUs. This expectation is because math operations are run on NVIDIA A100 Tensor Cores GPUs with the new TF32 precision format. Although TF32 reduces the precision by a small margin, it preserves the range of FP32 and strikes an excellent balance between speed and accuracy. Matrix multiplication gained a sizable boost from 13.4 TFLOPS (FP32 on the NVIDIA V100S GPU) to 86.5 TFLOPS (TF32 on the NVIDIA A100 GPU).

 

MLPerf Training v0.7 ResNet-50

Both runs using two NVIDIA A100 GPUs and two NVIDIA V100S GPUs converged at the 40th epoch. The NVIDIA A100 run took 166 minutes to converge, which is 1.8 times faster than the NVIDIA V100S run. Regarding throughput, two NVIDIA A100 GPUs can process 5240 images per second, which is also 1.8 times faster than the two NVIDIA V100S GPUs.

HPC Application Performance with Nvidia V100 versus A100 on Dell PowerEdge R7525 Servers

Articles Taken from: HPC Application Performance on Dell PowerEdge R7525 Servers with NVIDIA A100 GPGPUs

Difference between Nvidia A100 GPGPU versus Nvidia V100s GPGPU

NVIDIA A100 GPGPU NVIDIA V100S GPGPU
Form factor SXM4 PCIe Gen4 SXM2 PCIe Gen3
GPU architecture Ampere Volta
Memory size 40 GB 40 GB 32 GB 32 GB
CUDA cores 6912 5120
Base clock 1095 MHz 765 MHz 1290 MHz 1245 MHz
Boost clock 1410 MHz 1530 MHz 1597 MHz
Memory clock 1215 MHz 877 MHz 1107 MHz
MIG support Yes No
Peak memory bandwidth Up to 1555 GB/s Up to 900 GB/s Up to 1134 GB/s
Total board power 400 W 250 W 300 W 250 W

Benchmark Results (In Summary)

HPL performance comparison for the PowerEdge R7525 server with either NVIDIA A100 or NVIDIA V100S GPGPUs

HPCG performs at a rate 70 percent higher with the NVIDIA A100 GPGPU due to higher memory bandwidth

HPCG performs at a rate 70 percent higher with the NVIDIA A100 GPGPU due to higher memory bandwidth

Getting on board Nvidia GPGPU on CentOS KVM

  1. For vGPU test you’ll need a license, which can be requested here:
    https://www.nvidia.com/object/nvidia-enterprise-account.html
  2. Other documentation for installing vGPU on  Red Hat / CentOS is here:
    https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#red-hat-el-kvm-install-configure-vgpu
  3. Virtual GPU Software Quick Start Guide
    https://linuxcluster.wordpress.com/2019/01/28/virtual-gpu-software-quick-start-guide/

In summary the steps are:
– Install a piece of sw in the host/hypervisor to help virtualize GPUs
– Install the GPU drivers inside the guest OS of the VMs
– Install a license server (flex) for the licensing
– Configure license server and settings within the VM to connect to the license server

 

Nvidia Tesla versus Nvidia GTX Cards

References

  1. Performance Comparison between NVIDIA’s GeForce GTX 1080 and Tesla P100 for Deep Learning
  2. Comparison of NVIDIA Tesla/Quadro and NVIDIA GeForce GPUs

 

Nvidia EULA

Key clauses are: 2.1.3 that states no DC deployment, commercial hosting and broadcast services
http://www.nvidia.com/content/DriverDownload-March2009/licence.php?lang=us&type=GeForce

 

FP64 64-bits (Double Precision) Floating Point Calculation


Pix taken from Comparison of NVIDIA Tesla/Quadro and NVIDIA GeForce GPUs

FP16-16bits (Half Precision) Floating Point Calculation


Pix taken from Comparison of NVIDIA Tesla/Quadro and NVIDIA GeForce GPUs

Developing a Linux Kernel Module using GPUDirect RDMA

Taken from Developing a Linux Kernel Module using GPUDirect RDMA

1.0 Overview

GPUDirect RDMA is a technology introduced in Kepler-class GPUs and CUDA 5.0 that enables a direct path for data exchange between the GPU and a third-party peer device using standard features of PCI Express. Examples of third-party devices are: network interfaces, video acquisition devices, storage adapters.

GPUDirect RDMA is available on both Tesla and Quadro GPUs.

A number of limitations can apply, the most important being that the two devices must share the same upstream PCI Express root complex. Some of the limitations depend on the platform used and could be lifted in current/future products.

A few straightforward changes must be made to device drivers to enable this functionality with a wide range of hardware devices. This document introduces the technology and describes the steps necessary to enable an GPUDirect RDMA connection to NVIDIA GPUs on Linux.

 

1.1. How GPUDirect RDMA Works

When setting up GPUDirect RDMA communication between two peers, all physical addresses are the same from the PCI Express devices’ point of view. Within this physical address space are linear windows called PCI BARs. Each device has six BAR registers at most, so it can have up to six active 32bit BAR regions. 64bit BARs consume two BAR registers. The PCI Express device issues reads and writes to a peer device’s BAR addresses in the same way that they are issued to system memory.

Traditionally, resources like BAR windows are mapped to user or kernel address space using the CPU’s MMU as memory mapped I/O (MMIO) addresses. However, because current operating systems don’t have sufficient mechanisms for exchanging MMIO regions between drivers, the NVIDIA kernel driver exports functions to perform the necessary address translations and mappings.

To add GPUDirect RDMA support to a device driver, a small amount of address mapping code within the kernel driver must be modified. This code typically resides near existing calls to get_user_pages().

The APIs and control flow involved with GPUDirect RDMA are very similar to those used with standard DMA transfers.

References:

Read more at: http://docs.nvidia.com/cuda/gpudirect-rdma/index.html