March 11, 2024, 4:41 a.m. | Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, Yang Liu

cs.LG updates on arXiv.org arxiv.org

arXiv:2403.05171v1 Announce Type: new
Abstract: We introduce Adversarial Policy Optimization (AdvPO), a novel solution to the pervasive issue of reward over-optimization in Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Over-optimization occurs when a reward model serves as an imperfect proxy for human preference, and RL-driven policy optimization erroneously exploits reward inaccuracies. In this paper, we begin by introducing a lightweight way to quantify uncertainties in rewards, relying solely on the last layer embeddings of the reward …

abstract adversarial arxiv cs.ai cs.lg feedback human human feedback issue language language models large language large language models llms novel optimization policy reinforcement reinforcement learning reward model rlhf solution type uncertainty via

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US