Do Language Models Care About Text Quality? Evaluating Web-Crawled Corpora Across 11 Languages | allainews.com

March 14, 2024, 4:48 a.m. | Rik van Noord, Taja Kuzman, Peter Rupnik, Nikola Ljube\v{s}i\'c, Miquel Espl\`a-Gomis, Gema Ram\'irez-S\'anchez, Antonio Toral

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.08693v1 Announce Type: new
Abstract: Large, curated, web-crawled corpora play a vital role in training language models (LMs). They form the lion's share of the training data in virtually all recent LMs, such as the well-known GPT, LLaMA and XLM-RoBERTa models. However, despite this importance, relatively little attention has been given to the quality of these corpora. In this paper, we compare four of the currently most relevant large, web-crawled corpora (CC100, MaCoCu, mC4 and OSCAR) across eleven lower-resourced European …

abstract arxiv cs.cl data form gpt however importance language language models languages llama lms quality roberta role text training training data type vital web

More from arxiv.org / cs.CL updates on arXiv.org

Sparse is Enough in Fine-tuning Pre-trained Large Language Models 1 day ago | arxiv.org

arxiv cs.ai cs.cl cs.lg +6

On the Learnability of Watermarks for Language Models 1 day ago | arxiv.org

abstract arxiv cs.cl cs.cr +17

StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization 1 day ago | arxiv.org

abstract arxiv capabilities cs.ai +14

Evaluating Generative Ad Hoc Information Retrieval 1 day ago | arxiv.org

abstract advances arxiv cs.cl +19

Language Models As Semantic Indexers 1 day ago | arxiv.org

arxiv cs.cl cs.ir cs.lg +4

Large language models can accurately predict searcher preferences 1 day ago | arxiv.org

abstract arxiv cs.ai cs.cl +16

On the Reliability of Watermarks for Large Language Models 1 day ago | arxiv.org

abstract arxiv become bots +28

A Watermark for Large Language Models 1 day ago | arxiv.org

abstract arxiv cs.cl cs.cr +16

CreoleVal: Multilingual Multitask Benchmarks for Creoles 1 day ago | arxiv.org

abstract annotated data arxiv benchmarks +14

AI Engineer Intern, Agents

@ Occam AI | US

View on ai-jobs.net

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Lead Data Modeler

@ Sherwin-Williams | Cleveland, OH, United States

View on ai-jobs.net