Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes | allainews.com

March 5, 2024, 2:43 p.m. | Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho

cs.LG updates on arXiv.org arxiv.org

arXiv:2403.00867v1 Announce Type: cross
Abstract: Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, …

abstract advanced ai tool arxiv attacks cs.ai cs.cl cs.cr cs.lg generative gradient harm human jailbreak language language models large language large language models llm llms loss misuse query reduce tool type values

More from arxiv.org / cs.LG updates on arXiv.org

Deep learning enhanced mixed integer optimization: Learning to reduce model dimensionality 23 hours ago | arxiv.org

abstract arxiv complexity computational +20

Moderating New Waves of Online Hate with Chain-of-Thought Reasoning in Large Language Models 23 hours ago | arxiv.org

abstract arxiv cs.cl cs.cy +14

CaloQVAE : Simulating high-energy particle-calorimeter interactions using hybrid quantum-classical generative models 23 hours ago | arxiv.org

abstract analysis arxiv challenges +23

Swallowing the Bitter Pill: Simplified Scalable Conformer Generation 23 hours ago | arxiv.org

abstract advantages art arxiv +18

Intrinsic Bayesian Cram\'er-Rao Bound with an Application to Covariance Matrix Estimation 23 hours ago | arxiv.org

abstract application arxiv bayesian +18

Field-level simulation-based inference with galaxy catalogs: the impact of systematic effects 23 hours ago | arxiv.org

abstract arxiv astro-ph.co astro-ph.ga +19

Faithfulness Measurable Masked Language Models 23 hours ago | arxiv.org

abstract arxiv cs.cl cs.lg +12

Preserving Tumor Volumes for Unsupervised Medical Image Registration 23 hours ago | arxiv.org

arxiv cs.cv cs.lg eess.iv +6

Flexible and efficient spatial extremes emulation via variational autoencoders 23 hours ago | arxiv.org

abstract aim arxiv autoencoders +13

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

View on ai-jobs.net

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net