June 11, 2024, 4:43 a.m. | Yuanpu Cao, Bochuan Cao, Jinghui Chen

cs.CL updates on arXiv.org arxiv.org

arXiv:2312.00027v2 Announce Type: replace-cross
Abstract: Recent developments in Large Language Models (LLMs) have manifested significant advancements. To facilitate safeguards against malicious exploitation, a body of research has concentrated on aligning LLMs with human preferences and inhibiting their generation of inappropriate content. Unfortunately, such alignments are often vulnerable: fine-tuning with a minimal amount of harmful data can easily unalign the target LLM. While being effective, such fine-tuning-based unalignment approaches also have their own limitations: (1) non-stealthiness, after fine-tuning, safety audits or …

abstract arxiv backdoor cs.ai cs.cl cs.cr exploitation fine-tuning human inappropriate language language models large language large language models llms replace research safeguards type via vulnerable

Senior Data Engineer

@ Displate | Warsaw

Principal Architect

@ eSimplicity | Silver Spring, MD, US

Embedded Software Engineer

@ Carrier | CAN03: Carrier-Charlotte, NC 9701 Old Statesville Road, Charlotte, NC, 28269 USA

(USA) Software Engineer III

@ Roswell Park Comprehensive Cancer Center | (USA) CA SUNNYVALE Home Office SUNNYVALE III - 840 W CALIFORNIA

Experienced Manufacturing and Automation Engineer

@ Boeing | DEU - Munich, Germany

Software Engineering-Sr Engineer (Java 17, Python, Microservices, Spring Boot, REST)

@ FICO | Bengaluru, India