April 19, 2024, 4:42 a.m. | Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.12358v1 Announce Type: new
Abstract: Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches. Standard RLHF deploys reinforcement learning in a specific token-level …

abstract ai models algorithms alignment arxiv cs.lg direct preference optimization dpo feedback function generative generative ai models human human feedback language language model nature optimization pipeline reinforcement reinforcement learning rlhf success type

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Data Engineer (m/f/d)

@ Project A Ventures | Berlin, Germany

Principle Research Scientist

@ Analog Devices | US, MA, Boston