March 29, 2024, 3:56 p.m. | /u/fasttosmile

Machine Learning www.reddit.com

edit: To be clear I'm interested in how FSDP deals with models being too big for one GPU (the fact that Data Parallelism is used is not what I want to discuss).

I got to this understanding after reading the [FSDP paper](https://arxiv.org/abs/2304.11277):

> 1. Perform computations with parameter shards and communicate activations accordingly. [...]
> 2. Perform the same computation as local training by communicating parameter on-demand before computations.
Since parameter communications do not have any data dependency on preceding …

big clear computation data deals discuss edit gpu machinelearning pipeline pytorch training

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Sr. BI Analyst

@ AkzoNobel | Pune, IN