March 26, 2024, 4:43 a.m. | Shuai Zhao, Linchao Zhu, Ruijie Quan, Yi Yang

cs.LG updates on arXiv.org arxiv.org

arXiv:2403.15740v1 Announce Type: cross
Abstract: Web user data plays a central role in the ecosystem of pre-trained large language models (LLMs) and their fine-tuned variants. Billions of data are crawled from the web and fed to LLMs. How can \textit{\textbf{everyday web users}} confirm if LLMs misuse their data without permission? In this work, we suggest that users repeatedly insert personal passphrases into their documents, enabling LLMs to memorize them. These concealed passphrases in user documents, referred to as \textit{ghost sentences}, …

abstract arxiv copyright cs.cl cs.cr cs.ir cs.lg data ecosystem fed ghost language language models large language large language models llms misuse role tool type user data variants web

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Data Engineer - Takealot Group (Takealot.com | Superbalist.com | Mr D Food)

@ takealot.com | Cape Town