April 15, 2024, 4:42 a.m. | Jonathan D. Chang, Wenhao Shan, Owen Oertell, Kiant\'e Brantley, Dipendra Misra, Jason D. Lee, Wen Sun

cs.LG updates on arXiv.org arxiv.org

arXiv:2404.08495v1 Announce Type: new
Abstract: Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that …

arxiv cs.ai cs.cl cs.lg dataset optimization policy rlhf type

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Business Data Analyst

@ Alstom | Johannesburg, GT, ZA