Feb. 6, 2024, 5:15 a.m. |

Mozilla Foundation Blog foundation.mozilla.org





Mozilla investigates Common Crawl’s influence as a backbone for Large Language Models: its shortcomings, benefits, and implications for trustworthy AI











(BERLIN, GERMANY | FEBRUARY 6, 2024) — When OpenAI rolled out its text generator ChatGPT in 2022, few paid attention to the outsized importance of its chief training dataset, Common Craw.

Now, Mozilla’s new study “Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI” shows how Common Crawl laid the infrastructural foundation that shaped today’s …

attention benefits berlin chatgpt data data infrastructure generative generator germany importance influence infrastructure language language models large language large language models mozilla openai report text trustworthy trustworthy ai

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Software Engineer, Data Tools - Full Stack

@ DoorDash | Pune, India

Senior Data Analyst

@ Artsy | New York City