Feb. 6, 2024, 5:43 a.m. | Gaurav Pandey Yatin Nandwani Tahira Naseem Mayank Mishra Guangxuan Xu Dinesh Raghu Sachindra Joshi

cs.LG updates on arXiv.org arxiv.org

Following the success of Proximal Policy Optimization (PPO) for Reinforcement Learning from Human Feedback (RLHF), new techniques such as Sequence Likelihood Calibration (SLiC) and Direct Policy Optimization (DPO) have been proposed that are offline in nature and use rewards in an indirect manner. These techniques, in particular DPO, have recently become the tools of choice for LLM alignment due to their scalability and performance. However, they leave behind important features of the PPO approach. Methods such as SLiC or RRHF …

bayesian brain cs.ai cs.lg feedback human human feedback inference language language generation likelihood natural natural language natural language generation nature offline optimization policy ppo reinforcement reinforcement learning rlhf success

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Data Science Analyst

@ Mayo Clinic | AZ, United States

Sr. Data Scientist (Network Engineering)

@ SpaceX | Redmond, WA