April 21, 2024, 10:24 p.m. | 1littlecoder

1littlecoder www.youtube.com

🔗 Links 🔗

🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer

What is it?
The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library.

🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of …

data data processing dataset english llama llama 3 llm llm performance performance pipeline processing ran scratch tokens web

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York