all AI news
Gradient-Based Language Model Red Teaming. (arXiv:2401.16656v1 [cs.CL])
cs.CL updates on arXiv.org arxiv.org
Red teaming is a common strategy for identifying weaknesses in generative
language models (LMs), where adversarial prompts are produced that trigger an
LM to generate unsafe responses. Red teaming is instrumental for both model
alignment and evaluation, but is labor-intensive and difficult to scale when
done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a
red teaming method for automatically generating diverse prompts that are likely
to cause an LM to output unsafe responses. GBRT is a …
adversarial alignment arxiv cs.cl evaluation generate generative gradient humans labor language language model language models lms paper prompts red teaming responses scale strategy