June 9, 2023, 9:01 p.m. | /u/CS-fan-101

Natural Language Processing www.reddit.com

SlimPajama cleans and deduplicates RedPajama-1T, reducing the total token count and file size by 50%. It's half the size and trains twice as fast!

It’s the highest quality dataset when training to 600B tokens and, when upsampled, performs equal or better than RedPajama. It was no mean feat to deduplicate data on this scale – existing tools do not scale to a trillion tokens. We built a custom parallel data pre-processing pipeline and are sharing the code open source with …

count dataset language language models languagetechnology large language models quality redpajama tokens training trains

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US