Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. (arXiv:2209.07858v2 [cs.CL] UPDATED) | allainews.com

Nov. 24, 2022, 7:18 a.m. | Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy

cs.CL updates on arXiv.org arxiv.org

We describe our early efforts to red team language models in order to
simultaneously discover, measure, and attempt to reduce their potentially
harmful outputs. We make three main contributions. First, we investigate
scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B
parameters) and 4 model types: a plain language model (LM); an LM prompted to
be helpful, honest, and harmless; an LM with rejection sampling; and a model
trained to be helpful and harmless using reinforcement …

arxiv language language models lessons learned reduce scaling

More from arxiv.org / cs.CL updates on arXiv.org

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition 1 day, 6 hours ago | arxiv.org

abstract artificial artificial general intelligence arxiv +19

Visually grounded few-shot word learning in low-resource settings 1 day, 6 hours ago | arxiv.org

abstract arxiv cs.cl eess.as +16

KTRL+F: Knowledge-Augmented In-Document Search 1 day, 6 hours ago | arxiv.org

abstract arxiv challenges cs.cl +12

Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering 1 day, 6 hours ago | arxiv.org

abstract alignment applications arxiv +19

Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks 1 day, 6 hours ago | arxiv.org

abstract arxiv context cs.cl +17

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models 1 day, 6 hours ago | arxiv.org

arxiv cs.cl dataset framework +9

Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models 1 day, 6 hours ago | arxiv.org

abstract accuracy analysis arxiv +18

Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement 1 day, 6 hours ago | arxiv.org

arxiv cs.ai cs.cl language +6

MFE-NER: Multi-feature Fusion Embedding for Chinese Named Entity Recognition 1 day, 6 hours ago | arxiv.org

abstract arxiv characters chinese +10

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

Lead Software Engineer - Artificial Intelligence, LLM

@ OpenText | Hyderabad, TG, IN

View on ai-jobs.net

Lead Software Engineer- Python Data Engineer

@ JPMorgan Chase & Co. | GLASGOW, LANARKSHIRE, United Kingdom

View on ai-jobs.net

Data Analyst (m/w/d)

@ Collaboration Betters The World | Berlin, Germany

View on ai-jobs.net

Data Engineer, Quality Assurance

@ Informa Group Plc. | Boulder, CO, United States

View on ai-jobs.net

Director, Data Science - Marketing

@ Dropbox | Remote - Canada

View on ai-jobs.net