all AI news
Proving membership in LLM pretraining data via data watermarks
Feb. 19, 2024, 5:43 a.m. | Johnny Tian-Zheng Wei, Ryan Yixiang Wang, Robin Jia
cs.LG updates on arXiv.org arxiv.org
Abstract: Detecting whether copyright holders' works were used in LLM pretraining is poised to be an important problem. This work proposes using data watermarks to enable principled detection with only black-box model access, provided that the rightholder contributed multiple training documents and watermarked them before public release. By applying a randomly sampled data watermark, detection can be framed as hypothesis testing, which provides guarantees on the false detection rate. We study two watermarks: one that inserts …
abstract arxiv box contributed copyright cs.cl cs.cr cs.lg data detection documents llm multiple pretraining public release them training type via watermarks work
More from arxiv.org / cs.LG updates on arXiv.org
Efficient Data-Driven MPC for Demand Response of Commercial Buildings
2 days, 10 hours ago |
arxiv.org
Testing the Segment Anything Model on radiology data
2 days, 10 hours ago |
arxiv.org
Calorimeter shower superresolution
2 days, 10 hours ago |
arxiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US