March 29, 2024, 3:56 p.m. | /u/fasttosmile

Machine Learning www.reddit.com

edit: To be clear I'm interested in how FSDP deals with models being too big for one GPU (the fact that Data Parallelism is used is not what I want to discuss).

I got to this understanding after reading the [FSDP paper](https://arxiv.org/abs/2304.11277):

> 1. Perform computations with parameter shards and communicate activations accordingly. [...]
> 2. Perform the same computation as local training by communicating parameter on-demand before computations.
Since parameter communications do not have any data dependency on preceding …

big clear computation data deals discuss edit gpu machinelearning pipeline pytorch training

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US