all AI news
Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities
March 13, 2024, 4:43 a.m. | Yunze Wei, Tianshuo Hu, Cong Liang, Yong Cui
cs.LG updates on arXiv.org arxiv.org
Abstract: The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources that exceed those of a single GPU, necessitating distributed training. As GPU performance has rapidly evolved in recent years, computation time has shrunk, thereby increasing the proportion of communication in the overall training time. Therefore, optimizing communication for distributed training has become an urgent issue. In …
abstract advances architecture arxiv communication computing computing resources cs.dc cs.lg deep neural network distributed gpu large-scale models massive memory network neural network numbers opportunities optimization performance resources scale training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
AI Engineering Manager
@ M47 Labs | Barcelona, Catalunya [Cataluña], Spain