March 8, 2022, 4:37 p.m. | Nir Barazida

Towards Data Science - Medium towardsdatascience.com

Image from Unsplash

Stragglers and latency in synchronous distributed training of deep learning models

A review of the challenges in Synchronous distributed training and best solutions for stragglers and high latency

Abstract

Synchronous distributed training is a common way of distributing the training process of machine learning models with data parallelism. In synchronous training, a root aggregator node fans-out requests to many leaf nodes that work in parallel over different input data slices and return their results to the root …

data science deep learning devops distributed learning machine learning mlops model-training training

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York