April 26, 2024, 4:30 p.m. | Maxim Saplin

DEV Community dev.to

This week Hugging Face has released what seems to be the largest (15 trillion tokens) open dataset specifically created for LLM training: FineWeb.


It is based on internet crawls between the Summer of 2013 and the Winter 2024. The 15T size of the dataset resonates with the Llama 3 release that happened just a week before - it was trained with a 15T dataset as well (versus 2T used with Llama 2 series). The leap in the amount of …

ai aiops costs dataset face genai gpu hugging face improving internet llama llm machinelearning quality tokens training

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote