all AI news
Releasing Common Corpus: the largest public domain dataset for training LLMs
Simon Willison's Weblog simonwillison.net
Releasing Common Corpus: the largest public domain dataset for training LLMs
Released today. 500 billion words from "a wide diversity of cultural heritage initiatives". 180 billion words of English, 110 billion of French, 30 billion of German, then Dutch, Spanish and Italian.
Includes quite a lot of US public domain data - 21 million digitized out-of-copyright newspapers (or do they mean newspaper articles?)
"This is only an initial part of what we have collected so far, in part due to …
ai billion copyright dataset diversity domain english ethics french generativeai german heritage italian llms public public domain spanish training training llms words