Jan. 16, 2024, 9:05 a.m. | /u/erap129

Machine Learning www.reddit.com

So this is the loss for (Direct Preference Optimization) DPO:



https://preview.redd.it/6ubjn8ekprcc1.png?width=1324&format=png&auto=webp&s=c932f5c030c2fb6b5f0f136934b047bc364d1dcc

I don't understand the division by pi\\\_ref (both for y\\\_w and for y\\\_l). I know the purpose is that the finetuned model won't stray too far away from the reference model, but Just looking at it mathematically - why should pi\\\_ref(y\\\_w|x) be close to pi\\\_theta(y\\\_w|x)?

At least for y\\\_w it seems like the loss would benefit from pi\\\_ref(y\\\_w|x) being as close as possible to 0 because we want to …

direct preference optimization equation loss machinelearning optimization question reference

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US