Feb. 13, 2024, 5:42 a.m. | Lichang Chen Chen Zhu Davit Soselia Jiuhai Chen Tianyi Zhou Tom Goldstein Heng Huang Mohammad

cs.LG updates on arXiv.org arxiv.org

In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. The same issue also holds for some reward models in RL. To address the challenges in both training and evaluation, we establish a more reliable evaluation protocol for comparing different training configurations, …

challenge cs.ai cs.cl cs.lg feedback hacking human human feedback issue llms reinforcement reinforcement learning rlhf study work

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Data Scientist (Database Development)

@ Nasdaq | Bengaluru-Affluence