Web: http://arxiv.org/abs/2007.03298

Jan. 14, 2022, 2:10 a.m. | Weiyan Wang, Cengguang Zhang, Liu Yang, Kai Chen, Kun Tan

cs.LG updates on arXiv.org arxiv.org

Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN
training in today's production clusters. However, due to the global
synchronization nature, its performance can be significantly influenced by
network bottlenecks caused by either static topology heterogeneity or dynamic
bandwidth contentions. Existing solutions, either system-level optimizations
strengthening BSP (e.g., Ring or Hierarchical All-reduce) or algorithmic
optimizations replacing BSP (e.g., ASP or SSP, which relax the global
barriers), do not completely solve the problem, as they may still suffer from
communication inefficiency or risk convergence inaccuracy.


In this paper, we …

arxiv distributed for network training

Statistics and Computer Science Specialist

@ Hawk-Research | Remote

Data Scientist, Credit/Fraud Strategy

@ Fora Financial | New York City

Postdoctoral Research Associate - Biomedical Natural Language Processing and Deep Learning

@ Oak Ridge National Laboratory - Oak Ridge, TN | Oak Ridge, TN, United States

Senior Machine Learning / Computer Vision Engineer

@ Glass Imaging | Los Altos, CA

Research Scientist in Biomedical Natural Language Processing and Deep Learning

@ Oak Ridge National Laboratory | Oak Ridge, TN

W3-Professorship for Intelligent Energy Management

@ Universität Bayreuth | Bayreuth, Germany