all AI news
Emergent and Predictable Memorization in Large Language Models. (arXiv:2304.11158v1 [cs.CL])
cs.CL updates on arXiv.org arxiv.org
Memorization, or the tendency of large language models (LLMs) to output
entire sequences from their training data verbatim, is a key concern for safely
deploying language models. In particular, it is vital to minimize a model's
memorization of sensitive datapoints such as those containing personal
identifiable information (PII). The prevalence of such undesirable memorization
can pose issues for model trainers, and may even require discarding an
otherwise functional model. We therefore seek to predict which sequences will
be memorized before …
arxiv behavior data information language language models large language models llms pii training training data