May 8, 2024, 9:14 p.m. | /u/Patrick-239

Machine Learning www.reddit.com

Checkpoints are super important during LLM training as they could help to restart a failed job from a last known good state. In the same time it is also a big challenge for a team, mostly because of checkpoints size and a fact that you want to save them ASAP without blocking a training process. For example, LLaMa 70B model checkpoint in training format is 782 gigabytes in size.

**How you will save them every hour?**

Based on our team …

big challenge good job llm llm training machinelearning save state team them tips training tricks

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US