March 25, 2024, 4:46 a.m. | Bahareh Harandizadeh, Abel Salinas, Fred Morstatter

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.14988v1 Announce Type: new
Abstract: This paper explores the pressing issue of risk assessment in Large Language Models (LLMs) as they become increasingly prevalent in various applications. Focusing on how reward models, which are designed to fine-tune pretrained LLMs to align with human values, perceive and categorize different types of risks, we delve into the challenges posed by the subjective nature of preference-based training data. By utilizing the Anthropic Red-team dataset, we analyze major risk categories, including Information Hazards, Malicious …

abstract applications arxiv assessment become cs.cl human issue key language language models large language large language models llms paper risk risk assessment threat type types values

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne