Sept. 19, 2022, 1:50 p.m. | Luhui Hu

Towards Data Science - Medium towardsdatascience.com

How to scale out training large models like GPT-3 & DALL-E 2 in PyTorch

Photo by Mark Harpur on Unsplash

Recent years have witnessed exponential growth in the scale of distributed parallel training and the size of deep learning models. In particular, Transformer-based language models have been stealing the show. The notorious GPT-3 blew out with 175 billion parameters and 96 attention layers with a 3.2 M batch size and 499 billion words. Exactly half a year later, Google published …

data distributed distributed-training model-parallelism pytorch training

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne