all AI news
Immunization against harmful fine-tuning attacks
Feb. 27, 2024, 5:50 a.m. | Domenic Rosati, Jan Wehner, Kai Williams, {\L}ukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz
cs.CL updates on arXiv.org arxiv.org
Abstract: Approaches to aligning large language models (LLMs) with human values has focused on correcting misalignment that emerges from pretraining. However, this focus overlooks another source of misalignment: bad actors might purposely fine-tune LLMs to achieve harmful goals. In this paper, we present an emerging threat model that has arisen from alignment circumvention and fine-tuning attacks. However, lacking in previous works is a clear presentation of the conditions for effective defence. We propose a set of …
abstract actors arxiv attacks cs.cl fine-tuning focus human language language models large language large language models llms paper pretraining threat type values
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US