Using multiple GPUs for Machine Learning

Taken from Sharcnet HPC

The Video will consider two cases – when the GPUs are inside a single node, and a multi-node case.

Performance Required for Deep Learning

There is this question that I wanted to find out about deep learning. What are essential System, Network, Protocol that will speed up the Training and/or Inferencing. There may not be necessary to employ the same level of requirements from Training to Inferencing and Vice Versa. I have received this information during a Nvidia Presentation

Training:

  1. Scalability requires ultra-fast networking
  2. Same hardware needs as HPC
  3. Extreme network bandwidth
  4. RDMA
  5. SHARP (Mellanox Scalable Hierarchical Aggregation and Reduction Protocol)
  6. GPUDirect (https://developer.nvidia.com/gpudirect)
  7. Fast Access Storage

Influencing

  1. Highly Transactional
  2. Ultra-low Latency
  3. Instant Network Response
  4. RDMA
  5. PeerDirect, GPUDirect

 

 

What is the difference between a DPU, a CPU, and a GPU?

An interesting blog to explain what is the difference a DPU, CPU, and GPU?

 

So What Makes a DPU Different?

A DPU is a new class of programmable processor that combines three key elements. A DPU is a system on a chip, or SOC, that combines:
An industry standard, high-performance, software programmable, multi-core CPU, typically based on the widely-used Arm architecture, tightly coupled to the other SOC components

A high-performance network interface capable of parsing, processing, and efficiently transferring data at line rate, or the speed of the rest of the network, to GPUs and CPUs

A rich set of flexible and programmable acceleration engines that offload and improve applications performance for AI and Machine Learning, security, telecommunications, and storage, among others.

For more information, do take a look at What’s a DPU? …And what’s the difference between a DPU, a CPU, and a GPU?

NVIDIA to Acquire Arm for $40 Billion, Creating World’s Premier Computing Company for the Age of AI

NVIDIA and SoftBank Group Corp. (SBG) today announced a definitive agreement under which NVIDIA will acquire Arm Limited from SBG and the SoftBank Vision Fund (together, “SoftBank”) in a transaction valued at $40 billion. The transaction is expected to be immediately accretive to NVIDIA’s non-GAAP gross margin and non-GAAP earnings per share.

The combination brings together NVIDIA’s leading AI computing platform with Arm’s vast ecosystem to create the premier computing company for the age of artificial intelligence, accelerating innovation while expanding into large, high-growth markets. SoftBank will remain committed to Arm’s long-term success through its ownership stake in NVIDIA, expected to be under 10 percent.

For more information, see NVIDIA to Acquire Arm for $40 Billion, Creating World’s Premier Computing Company for the Age of AI