all AI news
Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages
March 14, 2024, 4:48 a.m. | Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljube\v{s}i\'c, Miquel Espl\`a-Gomis, Gema Ram\'irez-S\'anchez, Antonio Toral
cs.CL updates on arXiv.org arxiv.org
Abstract: Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European …
abstract arxiv cs.cl data form gpt however importance language language models languages llama lms quality roberta role text training training data type vital web
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Lead Data Modeler
@ Sherwin-Williams | Cleveland, OH, United States