How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse | allainews.com

April 9, 2024, 4:42 a.m. | Mohamed El Amine Seddik, Suei-Wen Chen, Soufiane Hayou, Pierre Youssef, Merouane Debbah

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.05090v1 Announce Type: new
Abstract: The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in language models, we consider in this paper a statistical model that allows …

abstract analysis arxiv cs.ai cs.cl cs.lg data generated language language model loop model collapse performance recursive statistical synthetic synthetic data training type

More from arxiv.org / cs.LG updates on arXiv.org

Provably Stable Feature Rankings with SHAP and LIME 5 hours ago | arxiv.org

abstract arxiv attribution cs.lg +24

Meta-Learning Linear Quadratic Regulators: A Policy Gradient MAML Approach for Model-free LQR 5 hours ago | arxiv.org

abstract arxiv cs.lg finn +14

DITTO: Diffusion Inference-Time T-Optimization for Music Generation 5 hours ago | arxiv.org

arxiv cs.ai cs.lg cs.sd +9

Provably Scalable Black-Box Variational Inference with Structured Variational Families 5 hours ago | arxiv.org

abstract arxiv box complexity +15

Robotic Imitation of Human Actions 5 hours ago | arxiv.org

abstract arxiv challenges cs.lg +10

Consistency of semi-supervised learning, stochastic tug-of-war games, and the p-Laplacian 5 hours ago | arxiv.org

abstract arxiv cs.lg cs.na +22

Quantum Generative Diffusion Model: A Fully Quantum-Mechanical Model for Generating Quantum State Ensemble 5 hours ago | arxiv.org

abstract advance arxiv cs.lg +13

E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image Translation 5 hours ago | arxiv.org

abstract adversarial arxiv commercial +30

Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search 5 hours ago | arxiv.org

arxiv code cs.cl cs.ir +10

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

View on ai-jobs.net

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

View on ai-jobs.net

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

View on ai-jobs.net

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

View on ai-jobs.net

Data Analyst (Salesforce)

@ Lisinski Law Firm | Latin America

View on ai-jobs.net

Data Analyst

@ Fusemachines | India - Remote

View on ai-jobs.net