Oct. 30, 2023, 4:56 p.m. | Together

Blog Content - TOGETHER www.together.xyz

Releasing a new version of the RedPajama dataset, with 30 trillion filtered
and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps
covering 5 languages, along with 40+ pre-computed data quality annotations
that can be used for further filtering and weighting.

annotations data data quality dataset filtering language language models languages large language large language models quality raw redpajama research tokens training

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Principal Machine Learning Engineer (AI, NLP, LLM, Generative AI)

@ Palo Alto Networks | Santa Clara, CA, United States

Consultant Senior Data Engineer F/H

@ Devoteam | Nantes, France