all AI news
RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models
Feb. 16, 2024, 5:43 a.m. | Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, Prathap Ramachandra
cs.LG updates on arXiv.org arxiv.org
Abstract: Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference optimization (DPO) is proposed to address those challenges. However, DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model, limiting the effectiveness of …
abstract alignment arxiv cs.ai cs.cl cs.cv cs.lg direct preference optimization feedback finetuning human human feedback hybrid hyperparameter language language models large language large language models optimization policy ppo reinforcement reinforcement learning rlhf sampling type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Research Scientist (Computer Science)
@ Nanyang Technological University | NTU Main Campus, Singapore
Intern - Sales Data Management
@ Deliveroo | Dubai, UAE (Main Office)