all AI news
Dataset Reset Policy Optimization for RLHF
April 15, 2024, 4:42 a.m. | Jonathan D. Chang, Wenhao Shan, Owen Oertell, Kiant\'e Brantley, Dipendra Misra, Jason D. Lee, Wen Sun
cs.LG updates on arXiv.org arxiv.org
Abstract: Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that …
arxiv cs.ai cs.cl cs.lg dataset optimization policy rlhf type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Business Data Analyst
@ Alstom | Johannesburg, GT, ZA