all AI news
RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models
Oct. 30, 2023, 4:56 p.m. | Together
Blog Content - TOGETHER www.together.xyz
and deduplicated tokens (100+ trillions raw) from 84 CommonCrawl dumps
covering 5 languages, along with 40+ pre-computed data quality annotations
that can be used for further filtering and weighting.
annotations data data quality dataset filtering language language models languages large language large language models quality raw redpajama research tokens training
More from www.together.xyz / Blog Content - TOGETHER
Flash-Decoding for long-context inference
6 months, 2 weeks ago |
www.together.xyz
Faster inference enables up to 5x price reduction on Together API
8 months, 2 weeks ago |
www.together.xyz
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Principal Machine Learning Engineer (AI, NLP, LLM, Generative AI)
@ Palo Alto Networks | Santa Clara, CA, United States
Consultant Senior Data Engineer F/H
@ Devoteam | Nantes, France