Feb. 6, 2024, 5:48 a.m. | Luke Marks Amir Abdullah Luna Mendez Rauno Arike Philip Torr Fazl Barez

cs.LG updates on arXiv.org arxiv.org

Large language models (LLMs) fine-tuned by reinforcement learning from human feedback (RLHF) are becoming more widely deployed. We coin the term $\textit{Implicit Reward Model}$ (IRM) to refer to the changes that occur to an LLM during RLHF that result in high-reward generations. We interpret IRMs, and measure their divergence from the RLHF reward model used in the fine-tuning process that induced them. By fitting a linear function to an LLM's IRM, a reward model with the same type signature as …

beyond cs.lg divergence feedback human human feedback language language models large language large language models llm llms reinforcement reinforcement learning reward model rlhf training

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne