all AI news
Direct Preference Optimization with an Offset
Feb. 19, 2024, 5:42 a.m. | Afra Amini, Tim Vieira, Ryan Cotterell
cs.LG updates on arXiv.org arxiv.org
Abstract: Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. However, not all preference pairs are equal: while in some cases the preferred response is only slightly better than the …
abstract arxiv binary cs.ai cs.cl cs.lg data direct preference optimization fine-tuning human language language model language models large language large language models likelihood optimization reinforcement reinforcement learning reward model strategy train type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US
Research Engineer
@ Allora Labs | Remote
Ecosystem Manager
@ Allora Labs | Remote
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US