Feb. 15, 2024, 5:46 a.m. | Matthieu Meeus, Igor Shilov, Manuel Faysse, Yves-Alexandre de Montjoye

cs.CL updates on arXiv.org arxiv.org

arXiv:2402.09363v1 Announce Type: new
Abstract: Questions of fair use of copyright-protected content to train Large Language Models (LLMs) are being very actively debated. Document-level inference has been proposed as a new task: inferring from black-box access to the trained model whether a piece of content has been seen during training. SOTA methods however rely on naturally occurring memorization of (part of) the content. While very effective against models that memorize a lot, we hypothesize--and later confirm--that they will not work …

abstract arxiv box copyright cs.cl cs.cr document fair fair use inference language language models large language large language models llms questions sota train training type

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Quality Intern

@ Syngenta Group | Toronto, Ontario, Canada