all AI news
Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferences
March 5, 2024, 2:42 p.m. | Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Georgios Tzannetos, Goran Radanovi\'c, Adish Singla
cs.LG updates on arXiv.org arxiv.org
Abstract: In this paper, we take a step towards a deeper understanding of learning from human preferences by systematically comparing the paradigm of reinforcement learning from human feedback (RLHF) with the recently proposed paradigm of direct preference optimization (DPO). We focus our attention on the class of loglinear policy parametrization and linear reward functions. In order to compare the two paradigms, we first derive minimax statistical bounds on the suboptimality gap induced by both RLHF and …
abstract analysis arxiv comparative analysis cs.lg direct preference optimization feedback focus human human feedback optimization paper paradigm policy reinforcement reinforcement learning reward model rlhf type understanding
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US