March 3, 2024, 1:57 p.m. | /u/zetiansss

Machine Learning www.reddit.com

I'm reading the DeepMind paper "WARM: On the Benefits of Weight Averaged Reward Models". The paper is talking about the reward-hacking phenomenon.

In the paper, the authors use the KL-reward curve to detect reward-hacking phenomenon, saying that the reward starts to decrease and thus reward hacking happens. However, previous papers like [https://arxiv.org/pdf/2312.09244.pdf](https://arxiv.org/pdf/2312.09244.pdf) or [https://arxiv.org/pdf/2312.09244.pdf](https://arxiv.org/pdf/2312.09244.pdf) often use two reward models to detect reward hacking: the proxy reward and the true reward. The policy model is updated under the proxy reward, so …

benefits deepmind hacking machinelearning paper reading shows think warm

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Data Scientist (Database Development)

@ Nasdaq | Bengaluru-Affluence