March 11, 2024, 4:41 a.m. | Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, Yang Liu

cs.LG updates on arXiv.org arxiv.org

arXiv:2403.05171v1 Announce Type: new
Abstract: We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward …

abstract adversarial arxiv cs.ai cs.lg feedback human human feedback issue language language models large language large language models llms novel optimization policy reinforcement reinforcement learning reward model rlhf solution type uncertainty via

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Sr Business Intelligence Analyst

@ T. Rowe Price | Baltimore, MD

Business Intelligence Analyst, Market Insights and Analytics

@ Morningstar | Mumbai

Senior Back-End Developer - Generative AI

@ Aptiv | POL Krakow - Eng

System Architect (Document AI)

@ Trafigura | London - Traf Office