all AI news
How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse
April 9, 2024, 4:42 a.m. | Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah
cs.LG updates on arXiv.org arxiv.org
Abstract: The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in language models, we consider in this paper a statistical model that allows …
abstract analysis arxiv cs.ai cs.cl cs.lg data generated language language model loop model collapse performance recursive statistical synthetic synthetic data training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Senior Machine Learning Engineer
@ GPTZero | Toronto, Canada
ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)
@ HelloBetter | Remote
Doctoral Researcher (m/f/div) in Automated Processing of Bioimages
@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena
Seeking Developers and Engineers for AI T-Shirt Generator Project
@ Chevon Hicks | Remote
Data Analyst (Salesforce)
@ Lisinski Law Firm | Latin America
Data Analyst
@ Fusemachines | India - Remote