Oct. 30, 2023, 8:29 p.m. | 1littlecoder

1littlecoder www.youtube.com

RedPajama dataset, with 30 trillion filtered and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps covering 5 languages, along with 40+ pre-computed data quality annotations that can be used for further filtering and weighting.

Redpajama Data v2 Announcement - https://together.ai/blog/redpajama-data-v2

Redpajama based Projects - https://huggingface.co/search/full-text?q=redpajama

Redpajama Data Processing Scripts - https://github.com/togethercomputer/RedPajama-Data

Redpajama Data v2 on Hugging Face - https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

Common Crawl - https://commoncrawl.org/

❤️ If you want to support the channel ❤️
Support here:
Patreon - https://www.patreon.com/1littlecoder/
Ko-Fi - …

annotations build data data quality dataset filtering languages llms pre-training quality raw redpajama support tokens training

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Principal Data Architect - Azure & Big Data

@ MGM Resorts International | Home Office - US, NV

GN SONG MT Market Research Data Analyst 11

@ Accenture | Bengaluru, BDC7A