Gradient-Based Language Model Red Teaming. (arXiv:2401.16656v1 [cs.CL]) | allainews.com

Jan. 31, 2024, 4:41 p.m. | Nevan Wichers, Carson Denison, Ahmad Beirami

cs.CL updates on arXiv.org arxiv.org

Red teaming is a common strategy for identifying weaknesses in generative
language models (LMs), where adversarial prompts are produced that trigger an
LM to generate unsafe responses. Red teaming is instrumental for both model
alignment and evaluation, but is labor-intensive and difficult to scale when
done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a
red teaming method for automatically generating diverse prompts that are likely
to cause an LM to output unsafe responses. GBRT is a …

adversarial alignment arxiv cs.cl evaluation generate generative gradient humans labor language language model language models lms paper prompts red teaming responses scale strategy

More from arxiv.org / cs.CL updates on arXiv.org

The Silicon Ceiling: Auditing GPT's Race and Gender Biases in Hiring 15 hours ago | arxiv.org

abstract arxiv biases concerns +24

TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks 15 hours ago | arxiv.org

arxiv building cs.ai cs.cl +5

Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark 15 hours ago | arxiv.org

abstract analysis arxiv basic +28

Sampling the Swadesh List to Identify Similar Languages with Tree Spaces 15 hours ago | arxiv.org

abstract ancestry arxiv authors +21

Pseudo-Prompt Generating in Pre-trained Vision-Language Models for Multi-Label Medical Image Classification 15 hours ago | arxiv.org

arxiv classification cs.cl cs.cv +9

Decoding Emotions in Abstract Art: Cognitive Plausibility of CLIP in Recognizing Color-Emotion Associations 15 hours ago | arxiv.org

abstract art arxiv clip +17

Narrative to Trajectory (N2T+): Extracting Routes of Life or Death from Human Trafficking Text Corpora 15 hours ago | arxiv.org

abstract arxiv change climate +19

Large Language Models Show Human-like Social Desirability Biases in Survey Responses 15 hours ago | arxiv.org

abstract arxiv become behavior +25

Linearizing Large Language Models 15 hours ago | arxiv.org

arxiv cs.cl language language models +3

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

View on ai-jobs.net

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net

Research Engineer

@ Allora Labs | Remote

View on ai-jobs.net

Ecosystem Manager

@ Allora Labs | Remote

View on ai-jobs.net

Founding AI Engineer, Agents

@ Occam AI | New York

View on ai-jobs.net