all AI news
FineWeb 45TB Dataset: $500k GPU costs and Adult Content Improving LLM Quality
April 26, 2024, 4:30 p.m. | Maxim Saplin
DEV Community dev.to
This week Hugging Face has released what seems to be the largest (15 trillion tokens) open dataset specifically created for LLM training: FineWeb.
It is based on internet crawls between the Summer of 2013 and the Winter 2024. The 15T size of the dataset resonates with the Llama 3 release that happened just a week before - it was trained with a 15T dataset as well (versus 2T used with Llama 2 series). The leap in the amount of …
ai aiops costs dataset face genai gpu hugging face improving internet llama llm machinelearning quality tokens training
More from dev.to / DEV Community
Jobs in AI, ML, Big Data
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote