Feb. 16, 2024, 5:43 a.m. | Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, Prathap Ramachandra

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.10038v1 Announce Type: cross
Abstract: Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. However, proximal policy optimization (PPO) based RLHF is occasionally unstable requiring significant hyperparameter finetuning, and computationally expensive to maximize the estimated reward during alignment. Recently, direct preference optimization (DPO) is proposed to address those challenges. However, DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model, limiting the effectiveness of …

abstract alignment arxiv cs.ai cs.cl cs.cv cs.lg direct preference optimization feedback finetuning human human feedback hybrid hyperparameter language language models large language large language models optimization policy ppo reinforcement reinforcement learning rlhf sampling type

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Research Scientist (Computer Science)

@ Nanyang Technological University | NTU Main Campus, Singapore

Intern - Sales Data Management

@ Deliveroo | Dubai, UAE (Main Office)