Feb. 5, 2024, 6:59 p.m. | Allen Institute for AI

Allen Institute for AI www.youtube.com

Abstract: Reward models are commonly used in the process of large language model alignment but are prone to reward hacking, where the true reward diverges from the estimated reward as the language model drifts out-of-distribution. In this talk, I will discuss a recent study on the use of reward ensembles to mitigate reward hacking. The study demonstrates that reward models that originate from different pretrain seeds are effective at mitigating reward hacking, but when errors of ensemble members correlate, the …

abstract alignment discuss distribution hacking herding language language model large language large language model process reward model study talk true will

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US