all AI news
The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
April 17, 2024, 4:46 a.m. | Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, Chlo\'e Clavel
cs.CL updates on arXiv.org arxiv.org
Abstract: This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive …
abstract arxiv consequences cs.cl data diversity focus generated generative generative models impact language language models metrics performance practice study synthetic synthetic data text training type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Data Engineer - AWS
@ 3Pillar Global | Costa Rica
Cost Controller/ Data Analyst - India
@ John Cockerill | Mumbai, India, India, India