all AI news
TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning
April 1, 2024, 4:43 a.m. | William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Ajaya Durg, Samvit Kaul, Swati Gupta, Tushar Krishna
cs.LG updates on arXiv.org arxiv.org
Abstract: The surge of artificial intelligence, specifically large language models, has led to a rapid advent towards the development of large-scale machine learning training clusters. Collective communications within these clusters tend to be heavily bandwidth-bound, necessitating techniques to optimally utilize the available network bandwidth. This puts the routing algorithm for the collective at the forefront of determining the performance. Unfortunately, communication libraries used in distributed machine learning today are limited by a fixed set of routing …
abstract algorithm artificial artificial intelligence arxiv bandwidth collective communications cs.dc cs.lg development distributed intelligence language language models large language large language models machine machine learning network scale topology training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Data Scientist
@ Publicis Groupe | New York City, United States
Bigdata Cloud Developer - Spark - Assistant Manager
@ State Street | Hyderabad, India