all AI news
Don't Forget Your Reward Values: Language Model Alignment via Value-based Calibration
Feb. 27, 2024, 5:49 a.m. | Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Anh Tuan Luu
cs.CL updates on arXiv.org arxiv.org
Abstract: While Reinforcement Learning from Human Feedback (RLHF) significantly enhances the generation quality of Large Language Models (LLMs), recent studies have raised concerns regarding the complexity and instability associated with the Proximal Policy Optimization (PPO) algorithm, proposing a series of order-based calibration methods as viable alternatives. This paper delves further into current order-based methods, examining their inefficiencies in utilizing reward values and addressing misalignment issues. Building upon these findings, we propose a novel \textbf{V}alue-based \textbf{C}ali\textbf{B}ration (VCB) …
abstract algorithm alignment arxiv complexity concerns cs.ai cs.cl feedback human human feedback language language model language models large language large language models llms optimization policy ppo quality reinforcement reinforcement learning rlhf series studies type value values via
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Lead Data Modeler
@ Sherwin-Williams | Cleveland, OH, United States