Sept. 6, 2023, 4:13 p.m. | /u/Chen806

Machine Learning www.reddit.com

This may be basic question for some one but it bothers me for a while. For the instructgpt or whatever following model with alignment, RLHF seems to be the standards. We get human feedback and train a reward model, then we use rl to further finetune the model. However, why not directly use human feedback to finetune with a simple ranking loss(e.g pairwise loss)?
What might be the best advantage for RLHF?

alignment basic feedback human human feedback instructgpt loss machinelearning ranking rlhf standards

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote