all AI news
A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs
March 28, 2024, 4:43 a.m. | Siddharth Singh, Prajwal Singhania, Aditya K. Ranjan, Zack Sating, Abhinav Bhatele
cs.LG updates on arXiv.org arxiv.org
Abstract: Large communication costs are a critical bottleneck in training state-of-the-art neural networks on distributed systems. This paper introduces AxoNN, a novel four-dimensional (4D) parallelization approach, inspired by Agarwal's algorithm for matrix multiplication, for parallelizing tensor computations in deep learning, AxoNN employs two key strategies to minimize communication overhead. First, we optimize communication by overlapping expensive collective operations (reduce-scatter, all-gather, all-reduce) with computations. Our experiments with a 20-billion parameter transformer model demonstrate that these optimizations deliver …
abstract algorithm art arxiv communication costs cs.ai cs.dc cs.lg cs.pf deep learning distributed distributed systems gpus hybrid key matrix matrix multiplication networks neural networks novel paper parallelization scale state systems tensor training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US