all AI news
Immunization against harmful fine-tuning attacks
Feb. 27, 2024, 5:50 a.m. | Domenic Rosati, Jan Wehner, Kai Williams, {\L}ukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz
cs.CL updates on arXiv.org arxiv.org
Abstract: Approaches to aligning large language models (LLMs) with human values has focused on correcting misalignment that emerges from pretraining. However, this focus overlooks another source of misalignment: bad actors might purposely fine-tune LLMs to achieve harmful goals. In this paper, we present an emerging threat model that has arisen from alignment circumvention and fine-tuning attacks. However, lacking in previous works is a clear presentation of the conditions for effective defence. We propose a set of …
abstract actors arxiv attacks cs.cl fine-tuning focus human language language models large language large language models llms paper pretraining threat type values
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US