What's in the RedPajama-Data-1T LLM training set | allainews.com

s

April 17, 2023, 6:57 p.m. |

Simon Willison's Weblog simonwillison.net

RedPajama is "a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens". It's a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute.

They just announced their first release: RedPajama-Data-1T, a 1.2 trillion token dataset modelled on the training data described in the original LLaMA paper.

The full dataset is 2.67TB, so I decided not to try and download the whole thing! Here's what …

ai collaboration data dataset datasette datasettelite download generativeai homebrewllms institute llama llm llms paper project redpajama release research set stanford together tokens training training data

More from simonwillison.net / Simon Willison's Weblog

Si

Quoting Andrej Karpathy 7 hours ago | simonwillison.net

ai andrej karpathy andrejkarpathy article +12

Si

Experimenting with local alt text generation in Firefox Nightly 14 hours ago | simonwillison.net

adapt ai editor experimental +13

Si

How (some) good corporate engineering blogs are written 1 day, 19 hours ago | simonwillison.net

blogging blogs cloudflare companies +14

Si

Stealing everything you’ve ever typed or viewed on your own Windows PC is now possible … 1 day, 20 hours ago | simonwillison.net

code copilot disaster ever +12

Si

Quoting Will Larson 2 days, 8 hours ago | simonwillison.net

art ceo companies cost +10

Si

Man caught in scam after AI told him fake Facebook customer support number was legitimate 2 days, 11 hours ago | simonwillison.net

ai case chatbot customer +13

Si

Django Enhancement Proposal 14: Background Workers 2 days, 19 hours ago | simonwillison.net

django ecosystem frameworks howard +12

Si

Why, after 6 years, I’m over GraphQL 3 days, 17 hours ago | simonwillison.net

all in authorization complexity graphql +3

Si

What does the public in six countries think of generative AI in news? 3 days, 20 hours ago | simonwillison.net

ai chatgpt evidence generative +15

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

View on ai-jobs.net

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

View on ai-jobs.net

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

View on ai-jobs.net

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

View on ai-jobs.net

Senior Applied Data Scientist

@ dunnhumby | London

View on ai-jobs.net

Principal Data Architect - Azure & Big Data

@ MGM Resorts International | Home Office - US, NV

View on ai-jobs.net