all AI news
How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse
April 9, 2024, 4:42 a.m. | Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah
cs.LG updates on arXiv.org arxiv.org
Abstract: The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in language models, we consider in this paper a statistical model that allows …
abstract analysis arxiv cs.ai cs.cl cs.lg data generated language language model loop model collapse performance recursive statistical synthetic synthetic data training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Data Engineer - AWS
@ 3Pillar Global | Costa Rica
Cost Controller/ Data Analyst - India
@ John Cockerill | Mumbai, India, India, India