April 17, 2024, 4:46 a.m. | Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, Chlo\'e Clavel

cs.CL updates on arXiv.org arxiv.org

arXiv:2311.09807v2 Announce Type: replace
Abstract: This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive …

abstract arxiv consequences cs.cl data diversity focus generated generative generative models impact language language models metrics performance practice study synthetic synthetic data text training type

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Engineer - AWS

@ 3Pillar Global | Costa Rica

Cost Controller/ Data Analyst - India

@ John Cockerill | Mumbai, India, India, India