Web: http://arxiv.org/abs/2205.01863

May 5, 2022, 1:11 a.m. | Xuandong Zhao, Lei Li, Yu-Xiang Wang

cs.LG updates on arXiv.org arxiv.org

Large language models are shown to memorize privacy information such as
social security numbers in training data. Given the sheer scale of the training
corpus, it is challenging to screen and filter these privacy data, either
manually or automatically. In this paper, we propose Confidentially Redacted
Training (CRT), a method to train language generation models while protecting
the confidential segments. We borrow ideas from differential privacy (which
solves a related but distinct problem) and show that our method is able …

arxiv language modelling

More from arxiv.org / cs.LG updates on arXiv.org

Data Analyst, Patagonia Action Works

@ Patagonia | Remote

Data & Insights Strategy & Innovation General Manager

@ Chevron Services Company, a division of Chevron U.S.A Inc. | Houston, TX

Faculty members in Research areas such as Bayesian and Spatial Statistics; Data Privacy and Security; AI/ML; NLP; Image and Video Data Analysis

@ Ahmedabad University | Ahmedabad, India

Director, Applied Mathematics & Computational Research Division

@ Lawrence Berkeley National Lab | Berkeley, Ca

Business Data Analyst

@ MainStreet Family Care | Birmingham, AL

Assistant/Associate Professor of the Practice in Business Analytics

@ Georgetown University McDonough School of Business | Washington DC