Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking | allainews.com

Feb. 5, 2024, 6:59 p.m. | Allen Institute for AI

Allen Institute for AI www.youtube.com

Abstract: Reward models are commonly used in the process of large language model alignment but are prone to reward hacking, where the true reward diverges from the estimated reward as the language model drifts out-of-distribution. In this talk, I will discuss a recent study on the use of reward ensembles to mitigate reward hacking. The study demonstrates that reward models that originate from different pretrain seeds are effective at mitigating reward hacking, but when errors of ensemble members correlate, the …

abstract alignment discuss distribution hacking herding language language model large language large language model process reward model study talk true will

More from www.youtube.com / Allen Institute for AI

Robot Learning by Understanding Egocentric Videos 4 days, 10 hours ago | www.youtube.com

abstract and natural language processing computer computer vision +24

Project Sidewalk: Crowd+AI Techniques to Map and Assess Every Sidewalk in the World 1 week, 1 day ago | www.youtube.com

ai techniques every jon map +4

LMQL Programming Large Language Models 2 weeks ago | www.youtube.com

berlin computer computer science eth +19

Does Generative AI Infringe Copyright? 2 weeks, 2 days ago | www.youtube.com

copyright digital family generative +5

Figuring out how the world works: causality in a world full of real people 1 month, 4 weeks ago | www.youtube.com

abstract ai systems alignment build +13

Machine-Checked Proofs, and the Rise of Formal Methods in Mathematics 2 months, 1 week ago | www.youtube.com

abstract artificial artificial intelligence assistant +16

Beyond Test Accuracies for Studying Deep Neural Networks 2 months, 2 weeks ago | www.youtube.com

abstract accuracy beyond community +12

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking 2 months, 3 weeks ago | www.youtube.com

abstract alignment discuss distribution +12

Integrated Systems for Computational Scientific Discovery 3 months ago | www.youtube.com

abstract ai research astronomy biology +13

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

Senior ML Engineer

@ Carousell Group | Ho Chi Minh City, Vietnam

View on ai-jobs.net

Data and Insight Analyst

@ Cotiviti | Remote, United States

View on ai-jobs.net