PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails | allainews.com

Feb. 27, 2024, 5:50 a.m. | Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash

cs.CL updates on arXiv.org arxiv.org

arXiv:2402.15911v1 Announce Type: cross
Abstract: Large language models (LLMs) are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, …

abstract arxiv attacks automated cs.cl cs.cr defense generate humans jailbreak language language model language models large language large language model large language models layer llms rails them type universal work

More from arxiv.org / cs.CL updates on arXiv.org

Knowledge Graphs and Pre-trained Language Models enhanced Representation Learning for Conversational Recommender Systems 14 hours ago | arxiv.org

abstract arxiv context conversation +20

ProCoT: Stimulating Critical Thinking and Writing of Students through Engagement with Large Language Models (LLMs) 14 hours ago | arxiv.org

abstract active learning arxiv chatgpt +22

UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations 14 hours ago | arxiv.org

abstract arxiv commonsense cs.cl +10

Response: Emergent analogical reasoning in large language models 14 hours ago | arxiv.org

abstract acquired analogy arxiv +16

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization 14 hours ago | arxiv.org

abstract agents arxiv autonomous +18

NumLLM: Numeric-Sensitive Large Language Model for Chinese Finance 14 hours ago | arxiv.org

abstract arxiv chinese cs.ce +25

CookingSense: A Culinary Knowledgebase with Multidisciplinary Assertions 14 hours ago | arxiv.org

abstract acquired arxiv collection +17

GOLD: Geometry Problem Solver with Natural Language Description 14 hours ago | arxiv.org

abstract artificial artificial intelligence arxiv +22

Enhancing Surgical Robots with Embodied Intelligence for Autonomous Ultrasound Scanning 14 hours ago | arxiv.org

abstract arxiv autonomous cs.ai +17

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

#13721 - Data Engineer - AI Model Testing

@ Qualitest | Miami, Florida, United States

View on ai-jobs.net

Elasticsearch Administrator

@ ManTech | 201BF - Customer Site, Chantilly, VA

View on ai-jobs.net