all AI news
Don't Say No: Jailbreaking LLM by Suppressing Refusal
April 26, 2024, 4:47 a.m. | Yukai Zhou, Wenjie Wang
cs.CL updates on arXiv.org arxiv.org
Abstract: Ensuring the safety alignment of Large Language Models (LLMs) is crucial to generating responses consistent with human values. Despite their ability to recognize and avoid harmful queries, LLMs are vulnerable to "jailbreaking" attacks, where carefully crafted prompts elicit them to produce toxic content. One category of jailbreak attacks is reformulating the task as adversarial attacks by eliciting the LLM to generate an affirmative response. However, the typical attack in this category GCG has very limited …
abstract alignment arxiv attacks consistent cs.cl human jailbreak jailbreaking language language models large language large language models llm llms prompts queries responses safety them type values vulnerable
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Business Intelligence Architect - Specialist
@ Eastman | Hyderabad, IN, 500 008