March 14, 2024, 4:48 a.m. | Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljube\v{s}i\'c, Miquel Espl\`a-Gomis, Gema Ram\'irez-S\'anchez, Antonio Toral

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.08693v1 Announce Type: new
Abstract: Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European …

abstract arxiv cs.cl data form gpt however importance language language models languages llama lms quality roberta role text training training data type vital web

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Lead Data Modeler

@ Sherwin-Williams | Cleveland, OH, United States