all AI news
A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs
March 28, 2024, 4:43 a.m. | Siddharth Singh, Prajwal Singhania, Aditya K. Ranjan, Zack Sating, Abhinav Bhatele
cs.LG updates on arXiv.org arxiv.org
Abstract: Large communication costs are a critical bottleneck in training state-of-the-art neural networks on distributed systems. This paper introduces AxoNN, a novel four-dimensional (4D) parallelization approach, inspired by Agarwal's algorithm for matrix multiplication, for parallelizing tensor computations in deep learning, AxoNN employs two key strategies to minimize communication overhead. First, we optimize communication by overlapping expensive collective operations (reduce-scatter, all-gather, all-reduce) with computations. Our experiments with a 20-billion parameter transformer model demonstrate that these optimizations deliver …
abstract algorithm art arxiv communication costs cs.ai cs.dc cs.lg cs.pf deep learning distributed distributed systems gpus hybrid key matrix matrix multiplication networks neural networks novel paper parallelization scale state systems tensor training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
C003549 Data Analyst (NS) - MON 13 May
@ EMW, Inc. | Braine-l'Alleud, Wallonia, Belgium
Marketing Decision Scientist
@ Meta | Menlo Park, CA | New York City