DPO Meets PPO: Reinforced Token Optimization for RLHF | allainews.com

April 30, 2024, 4:42 a.m. | Han Zhong, Guhao Feng, Wei Xiong, Li Zhao, Di He, Jiang Bian, Liwei Wang

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.18922v1 Announce Type: new
Abstract: In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a …

abstract alignment art arxiv cs.ai cs.cl cs.lg dpo feedback framework human human feedback language language models large language large language models learn llms optimization policy ppo reinforcement reinforcement learning rlhf state stat.ml token type

More from arxiv.org / cs.LG updates on arXiv.org

Red-Teaming for Generative AI: Silver Bullet or Security Theater? 54 minutes ago | arxiv.org

abstract arxiv concerns cs.cy +15

Efficient Data-Driven MPC for Demand Response of Commercial Buildings 54 minutes ago | arxiv.org

abstract arxiv buildings commercial +20

BrepGen: A B-rep Generative Diffusion Model with Structured Latent Geometry 54 minutes ago | arxiv.org

arxiv cs.cv cs.lg diffusion +5

Data-Driven Physics-Informed Neural Networks: A Digital Twin Perspective 54 minutes ago | arxiv.org

abstract arxiv automated construction +26

Testing the Segment Anything Model on radiology data 54 minutes ago | arxiv.org

abstract applications arxiv become +20

Robust Point Matching with Distance Profiles 54 minutes ago | arxiv.org

abstract analyze arxiv cs.lg +13

Cell Maps Representation For Lung Adenocarcinoma Growth Patterns Classification In Whole Slide Images 54 minutes ago | arxiv.org

abstract arxiv behavior classification +18

Improved Baselines with Visual Instruction Tuning 54 minutes ago | arxiv.org

abstract academic arxiv clip +25

Calorimeter shower superresolution 54 minutes ago | arxiv.org

abstract arxiv challenge computational +16

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

View on ai-jobs.net

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

View on ai-jobs.net

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net