Introducing SlimPajama-627B: the largest extensively deduplicated, multi-corpora, open-source dataset for training large language models. | allainews.com

June 9, 2023, 9:01 p.m. | /u/CS-fan-101

Natural Language Processing www.reddit.com

SlimPajama cleans and deduplicates RedPajama-1T, reducing the total token count and file size by 50%. It's half the size and trains twice as fast!

It’s the highest quality dataset when training to 600B tokens and, when upsampled, performs equal or better than RedPajama. It was no mean feat to deduplicate data on this scale – existing tools do not scale to a trillion tokens. We built a custom parallel data pre-processing pipeline and are sharing the code open source with …

count dataset language language models languagetechnology large language models quality redpajama tokens training trains

More from www.reddit.com / Natural Language Processing

How big does a dataset have to be to fine-tune a transformer model for NER. 1 day, 16 hours ago | www.reddit.com

bert big database dataset +15

PhD in Linguistics: Which skills should I focus on? 2 days, 7 hours ago | www.reddit.com

communication computer computer science fields +12

Is the MA in computational linguistics that bad in Tubingen ? 2 days, 15 hours ago | www.reddit.com

computational languagetechnology linguistics

Which NLP-master programs in Europe are more cs-leaning? 6 days, 9 hours ago | www.reddit.com

computational english europe germany +12

What do you think is the state of the art technique for matching a piece … 1 week, 1 day ago | www.reddit.com

art city database example +9

Multilabel text classification on unlabled data 1 week, 1 day ago | www.reddit.com

classification data finance isn +11

I made a text-game where all the LLMs trick each other pretending to be humans. … 1 week, 2 days ago | www.reddit.com

game humans languagetechnology llms +3

Help with fraud recognition 1 week, 2 days ago | www.reddit.com

bank code country detection +7

AI-proof language-related jobs in the United States? 1 week, 4 days ago | www.reddit.com

jobs language languagetechnology management +4

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net