Feb. 26, 2024, 5:42 a.m. | Tianlin Li, Qian Liu, Tianyu Pang, Chao Du, Qing Guo, Yang Liu, Min Lin

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.14845v1 Announce Type: cross
Abstract: The emerging success of large language models (LLMs) heavily relies on collecting abundant training data from external (untrusted) sources. Despite substantial efforts devoted to data cleaning and curation, well-constructed LLMs have been reported to suffer from copyright infringement, data poisoning, and/or privacy violations, which would impede practical deployment of LLMs. In this study, we propose a simple and easily implementable method for purifying LLMs from the negative effects caused by uncurated data, namely, through ensembling …

abstract arxiv cleaning copyright copyright infringement cs.ai cs.cl cs.lg curation data data cleaning data poisoning infringement language language model language models large language large language models llms privacy small small language model success training training data type

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne