[R] RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models | allainews.com

Oct. 30, 2023, 9:55 p.m. | /u/APaperADay

Machine Learning www.reddit.com

**Blog**: [https://together.ai/blog/redpajama-data-v2](https://together.ai/blog/redpajama-data-v2)

**Hugging Face**: [https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2)

**GitHub**: [https://github.com/togethercomputer/RedPajama-Data](https://github.com/togethercomputer/RedPajama-Data)

**Description**:

>RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated.

dataset documents language language models large language large language models machinelearning pipeline quality redpajama text training

More from www.reddit.com / Machine Learning

[D] ECCV-2024 reviews are out 13 hours ago | www.reddit.com

eccv machinelearning reviews

[D] ICLR Outstanding Paper Awards. Congratulations! 15 hours ago | www.reddit.com

abstract feature identify images +12

[D] Where does the term "feature" come from? 16 hours ago | www.reddit.com

call engineering feature features +8

[D] Any encoder only model having bigger max token than 512 (BERT, Roberta, etc)? 23 hours ago | www.reddit.com

advance bert bigger class +8

[R] AlphaMath Almost Zero: process Supervision without process 23 hours ago | www.reddit.com

abstract code errors however +15

[D] ECCV 2024 Review Discussion 1 day ago | www.reddit.com

center conferences eccv machinelearning +5

[D] Is it a good idea for a 3rd year PhD student to start a … 1 day, 2 hours ago | www.reddit.com

academic extra good hearing +7

[D] Use VQ-VAEs for SSL? 1 day, 3 hours ago | www.reddit.com

create diffusion diffusion models embedding +10

[D] Matrix Profile vs. Deep Learning for Multivariate Time Series 1 day, 4 hours ago | www.reddit.com

context curiosity data deep learning +16

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net