all AI news
Rethinking the Role of PPO in RLHF
Oct. 16, 2023, 9 a.m. |
The Berkeley Artificial Intelligence Research Blog bair.berkeley.edu
Rethinking the Role of PPO in RLHF
TL;DR: In RLHF, there’s tension between the reward learning phase, which uses human preference in the form of comparisons, and the RL fine-tuning phase, which optimizes a single, non-comparative reward. What if we performed RL in a comparative way?
Figure 1:
This diagram illustrates the difference between reinforcement learning from absolute feedback and relative feedback. By incorporating a new component - pairwise policy gradient, we can unify the reward modeling stage and …
More from bair.berkeley.edu / The Berkeley Artificial Intelligence Research Blog
Modeling Extremely Large Images with xT
1 month, 1 week ago |
bair.berkeley.edu
Modeling Extremely Large Images with xT
1 month, 1 week ago |
bair.berkeley.edu
2024 BAIR Graduate Directory
1 month, 3 weeks ago |
bair.berkeley.edu
2024 BAIR Graduate Directory
1 month, 3 weeks ago |
bair.berkeley.edu
The Shift from Models to Compound AI Systems
2 months, 2 weeks ago |
bair.berkeley.edu
The Shift from Models to Compound AI Systems
2 months, 2 weeks ago |
bair.berkeley.edu
Ghostbuster: Detecting Text Ghostwritten by Large Language Models
5 months, 2 weeks ago |
bair.berkeley.edu
Ghostbuster: Detecting Text Ghostwritten by Large Language Models
5 months, 2 weeks ago |
bair.berkeley.edu
Asymmetric Certified Robustness via Feature-Convex Neural Networks
5 months, 2 weeks ago |
bair.berkeley.edu
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Lead Data Scientist, Commercial Analytics
@ Checkout.com | London, United Kingdom
Data Engineer I
@ Love's Travel Stops | Oklahoma City, OK, US, 73120