all AI news
More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness
April 30, 2024, 4:50 a.m. | Aaron J. Li, Satyapriya Krishna, Himabindu Lakkaraju
cs.CL updates on arXiv.org arxiv.org
Abstract: The surge in Large Language Models (LLMs) development has led to improved performance on cognitive tasks as well as an urgent need to align these models with human values in order to safely exploit their power. Despite the effectiveness of preference learning algorithms like Reinforcement Learning From Human Feedback (RLHF) in aligning human preferences, their assumed improvements on model trustworthiness haven't been thoroughly testified. Toward this end, this study investigates how models that have been …
abstract alignment arxiv cognitive cs.ai cs.cl development exploit human impact language language model language models large language large language models llms performance power rlhf tasks trust type values
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US