Oct. 16, 2023, 9 a.m. |

The Berkeley Artificial Intelligence Research Blog bair.berkeley.edu














Rethinking the Role of PPO in RLHF

TL;DR: In RLHF, there’s tension between the reward learning phase, which uses human preference in the form of comparisons, and the RL fine-tuning phase, which optimizes a single, non-comparative reward. What if we performed RL in a comparative way?





Figure 1:
This diagram illustrates the difference between reinforcement learning from absolute feedback and relative feedback. By incorporating a new component - pairwise policy gradient, we can unify the reward modeling stage and …

difference figure fine-tuning form human ppo rlhf role

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Lead Data Scientist, Commercial Analytics

@ Checkout.com | London, United Kingdom

Data Engineer I

@ Love's Travel Stops | Oklahoma City, OK, US, 73120