all AI news
Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models
April 5, 2024, 4:42 a.m. | Jingyang Zhang, Jingwei Sun, Eric Yeats, Yang Ouyang, Martin Kuo, Jianyi Zhang, Hao Yang, Hai Li
cs.LG updates on arXiv.org arxiv.org
Abstract: The problem of pre-training data detection for large language models (LLMs) has received growing attention due to its implications in critical issues like copyright violation and test data contamination. The current state-of-the-art approach, Min-K%, measures the raw token probability which we argue may not be the most informative signal. Instead, we propose Min-K%++ to normalize the token probability with statistics of the categorical distribution over the whole vocabulary, which accurately reflects the relative likelihood of …
abstract art arxiv attention copyright cs.cl cs.lg current data detection language language models large language large language models llms pre-training probability raw state test token training training data type
More from arxiv.org / cs.LG updates on arXiv.org
The Perception-Robustness Tradeoff in Deterministic Image Restoration
2 days, 5 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne