s
March 20, 2024, 7:34 p.m. |

Simon Willison's Weblog simonwillison.net

Releasing Common Corpus: the largest public domain dataset for training LLMs


Released today. 500 billion words from "a wide diversity of cultural heritage initiatives". 180 billion words of English, 110 billion of French, 30 billion of German, then Dutch, Spanish and Italian.


Includes quite a lot of US public domain data - 21 million digitized out-of-copyright newspapers (or do they mean newspaper articles?)


"This is only an initial part of what we have collected so far, in part due to …

ai billion copyright dataset diversity domain english ethics french generativeai german heritage italian llms public public domain spanish training training llms words

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Engineer - New Graduate

@ Applied Materials | Milan,ITA

Lead Machine Learning Scientist

@ Biogen | Cambridge, MA, United States