April 17, 2024, 4:46 a.m. | Yanzhu Guo, Guokan Shang, Michalis Vazirgiannis, Chlo\'e Clavel

cs.CL updates on arXiv.org arxiv.org

arXiv:2311.09807v2 Announce Type: replace
Abstract: This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive …

abstract arxiv consequences cs.cl data diversity focus generated generative generative models impact language language models metrics performance practice study synthetic synthetic data text training type

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US