What's in the RedPajama-Data-1T LLM training set | allainews.com

s

April 17, 2023, 6:57 p.m. |

Simon Willison's Weblog simonwillison.net

RedPajama is "a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens". It's a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.

They just announced their first release: RedPajama-Data-1T, a 1.2 trillion token dataset modelled on the training data described in the original LLaMA paper.

The full dataset is 2.67TB, so I decided not to try and download the whole thing! Here's what …

ai collaboration data dataset datasette datasettelite download generativeai homebrewllms institute llama llm llms paper project redpajama release research set stanford together tokens training training data

More from simonwillison.net / Simon Willison's Weblog

Si

Bullying in Open Source Software Is a Massive Security Vulnerability 15 minutes ago | simonwillison.net

backdoor contributor linux linux distributions +13

Si

experimental-phi3-webgpu 20 minutes ago | simonwillison.net

ai browser browsers cache +20

Si

datasette-pins — a new Datasette plugin for pinning tables and queries 4 hours ago | simonwillison.net

alex alexgarcia cloud databases +11

Si

Quoting Nathaniel Borenstein 1 day, 2 hours ago | simonwillison.net

basic consent engineer ethics +5

Si

Slop is the new name for unwanted AI-generated content 1 day, 4 hours ago | simonwillison.net

ai ai generated ai-generated content art +11

Si

OpenAI Model Spec, May 2024 edition 1 day, 4 hours ago | simonwillison.net

ai api chatgpt core +10

Si

Modern SQLite: Generated columns 1 day, 5 hours ago | simonwillison.net

antonzhiyanov features generated modern +6

Si

Tagged Pointer Strings (2015) 1 day, 8 hours ago | simonwillison.net

embed implementation least macos +6

Si

Towards universal version control with Patchwork 1 day, 20 hours ago | simonwillison.net

ai applications beyond control +12

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net