Feb. 5, 2024, 6:59 p.m. | Allen Institute for AI

Allen Institute for AI www.youtube.com

Abstract: Reward models are commonly used in the process of large language model alignment but are prone to reward hacking, where the true reward diverges from the estimated reward as the language model drifts out-of-distribution. In this talk, I will discuss a recent study on the use of reward ensembles to mitigate reward hacking. The study demonstrates that reward models that originate from different pretrain seeds are effective at mitigating reward hacking, but when errors of ensemble members correlate, the …

abstract alignment discuss distribution hacking herding language language model large language large language model process reward model study talk true will

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior ML Engineer

@ Carousell Group | Ho Chi Minh City, Vietnam

Data and Insight Analyst

@ Cotiviti | Remote, United States