all AI news
Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation
March 11, 2024, 4:41 a.m. | Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, Yang Liu
cs.LG updates on arXiv.org arxiv.org
Abstract: We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward …
abstract adversarial arxiv cs.ai cs.lg feedback human human feedback issue language language models large language large language models llms novel optimization policy reinforcement reinforcement learning reward model rlhf solution type uncertainty via
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Senior Machine Learning Engineer
@ GPTZero | Toronto, Canada
ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)
@ HelloBetter | Remote
Sr Business Intelligence Analyst
@ T. Rowe Price | Baltimore, MD
Business Intelligence Analyst, Market Insights and Analytics
@ Morningstar | Mumbai
Senior Back-End Developer - Generative AI
@ Aptiv | POL Krakow - Eng
System Architect (Document AI)
@ Trafigura | London - Traf Office