There is this question that I wanted to find out about deep learning. What are essential System, Network, Protocol that will speed up the Training and/or Inferencing. There may not be necessary to employ the same level of requirements from Training to Inferencing and Vice Versa. I have received this information during a Nvidia Presentation
Training:
- Scalability requires ultra-fast networking
- Same hardware needs as HPC
- Extreme network bandwidth
- RDMA
- SHARP (Mellanox Scalable Hierarchical Aggregation and Reduction Protocol)
- GPUDirect (https://developer.nvidia.com/gpudirect)
- Fast Access Storage
Influencing
- Highly Transactional
- Ultra-low Latency
- Instant Network Response
- RDMA
- PeerDirect, GPUDirect