Feb. 27, 2024, 5:49 a.m. | Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Anh Tuan Luu

cs.CL updates on arXiv.org arxiv.org

arXiv:2402.16030v1 Announce Type: new
Abstract: While Reinforcement Learning from Human Feedback (RLHF) significantly enhances the generation quality of Large Language Models (LLMs), recent studies have raised concerns regarding the complexity and instability associated with the Proximal Policy Optimization (PPO) algorithm, proposing a series of order-based calibration methods as viable alternatives. This paper delves further into current order-based methods, examining their inefficiencies in utilizing reward values and addressing misalignment issues. Building upon these findings, we propose a novel \textbf{V}alue-based \textbf{C}ali\textbf{B}ration (VCB) …

abstract algorithm alignment arxiv complexity concerns cs.ai cs.cl feedback human human feedback language language model language models large language large language models llms optimization policy ppo quality reinforcement reinforcement learning rlhf series studies type value values via

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Lead Data Modeler

@ Sherwin-Williams | Cleveland, OH, United States