Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms | allainews.com

April 22, 2024, 4:42 a.m. | Zhongyi Lin, Ning Sun, Pallab Bhattacharya, Xizhou Feng, Louis Feng, John D. Owens

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.12674v1 Announce Type: cross
Abstract: Characterizing and predicting the training performance of modern machine learning (ML) workloads on compute systems with compute and communication spread between CPUs, GPUs, and network devices is not only the key to optimization and planning but also a complex goal to achieve. The primary challenges include the complexity of synchronization and load balancing between CPUs and GPUs, the variance in input data distribution, and the use of different communication devices and topologies (e.g., NVLink, PCIe, …

abstract arxiv communication compute cpus cs.dc cs.lg cs.pf devices gpu gpus key machine machine learning modeling modern multi-gpu network optimization performance planning platforms systems the key training type universal workloads

More from arxiv.org / cs.LG updates on arXiv.org

Differentially private Bayesian tests 18 hours ago | arxiv.org

abstract arxiv bayesian cs.cr +20

What Are We Optimizing For? A Human-centric Evaluation of Deep Learning-based Movie Recommenders 18 hours ago | arxiv.org

abstract accuracy arxiv benchmark +21

Attention-Enhanced Reservoir Computing 18 hours ago | arxiv.org

abstract accuracy arxiv attention +11

Learning finitely correlated states: stability of the spectral reconstruction 18 hours ago | arxiv.org

abstract arxiv cs.et cs.lg +10

Transfer Learning in Robotics: An Upcoming Breakthrough? A Review of Promises and Challenges 18 hours ago | arxiv.org

abstract agents arxiv challenges +17

The Perception-Robustness Tradeoff in Deterministic Image Restoration 18 hours ago | arxiv.org

abstract arxiv behavior consistent +13

Conformal Decision Theory: Safe Autonomous Decisions from Imperfect Predictions 18 hours ago | arxiv.org

abstract algorithms arxiv autonomous +20

Fin-Fact: A Benchmark Dataset for Multimodal Financial Fact Checking and Explanation Generation 18 hours ago | arxiv.org

arxiv benchmark cs.ai cs.ce +6

TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models 18 hours ago | arxiv.org

abstract arxiv capabilities challenge +16

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Data Science Analyst

@ Mayo Clinic | AZ, United States

View on ai-jobs.net

Sr. Data Scientist (Network Engineering)

@ SpaceX | Redmond, WA

View on ai-jobs.net