Nvidia at ISC 2021

Speaker: Marc Hamilton, VP of Solutions Architecture and Engineering, NVIDIA Panelists: Nicola Rieke, Dion Harris, Timothy Costa, Gilad Shainer, Geetika Gupta

Performance Required for Deep Learning

There is this question that I wanted to find out about deep learning. What are essential System, Network, Protocol that will speed up the Training and/or Inferencing. There may not be necessary to employ the same level of requirements from Training to Inferencing and Vice Versa. I have received this information during a Nvidia Presentation

Training:

  1. Scalability requires ultra-fast networking
  2. Same hardware needs as HPC
  3. Extreme network bandwidth
  4. RDMA
  5. SHARP (Mellanox Scalable Hierarchical Aggregation and Reduction Protocol)
  6. GPUDirect (https://developer.nvidia.com/gpudirect)
  7. Fast Access Storage

Influencing

  1. Highly Transactional
  2. Ultra-low Latency
  3. Instant Network Response
  4. RDMA
  5. PeerDirect, GPUDirect

 

 

What is the difference between a DPU, a CPU, and a GPU?

An interesting blog to explain what is the difference a DPU, CPU, and GPU?

 

So What Makes a DPU Different?

A DPU is a new class of programmable processor that combines three key elements. A DPU is a system on a chip, or SOC, that combines:
An industry standard, high-performance, software programmable, multi-core CPU, typically based on the widely-used Arm architecture, tightly coupled to the other SOC components

A high-performance network interface capable of parsing, processing, and efficiently transferring data at line rate, or the speed of the rest of the network, to GPUs and CPUs

A rich set of flexible and programmable acceleration engines that offload and improve applications performance for AI and Machine Learning, security, telecommunications, and storage, among others.

For more information, do take a look at What’s a DPU? …And what’s the difference between a DPU, a CPU, and a GPU?