Dec. 3, 2023, 5 p.m. | Yannic Kilcher

Yannic Kilcher www.youtube.com

#chatgpt #privacy #promptengineering

Researchers were able to get giant amounts of training data out of ChatGPT by simply asking it to repeat a word many times over, which causes the model to diverge and start spitting out memorized text.
Why does this happen? And how much of their training data do such models really memorize verbatim?

OUTLINE:
0:00 - Intro
8:05 - Extractable vs Discoverable Memorization
14:00 - Models leak more data than previously thought
20:25 - Some data is …

chatgpt data explained extraction language language models paper privacy production promptengineering researchers scalable text training training data word

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne