April 21, 2024, 10:24 p.m. | 1littlecoder

1littlecoder www.youtube.com

🔗 Links 🔗

🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer

What is it?
The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library.

🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of …

data data processing dataset english llama llama 3 llm llm performance performance pipeline processing ran scratch tokens web

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Lead Data Scientist, Commercial Analytics

@ Checkout.com | London, United Kingdom

Data Engineer I

@ Love's Travel Stops | Oklahoma City, OK, US, 73120