Oct. 30, 2023, 8:29 p.m. | 1littlecoder

1littlecoder www.youtube.com

RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.

Redpajama Data v2 Announcement - https://together.ai/blog/redpajama-data-v2

Redpajama based Projects - https://huggingface.co/search/full-text?q=redpajama

Redpajama Data Processing Scripts - https://github.com/togethercomputer/RedPajama-Data

Redpajama Data v2 on Hugging Face - https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

Common Crawl - https://commoncrawl.org/

❤️ If you want to support the channel ❤️
Support here:
Patreon - https://www.patreon.com/1littlecoder/
Ko-Fi - …

annotations build data data quality dataset filtering languages llms pre-training quality raw redpajama support tokens training

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York