Feb. 9, 2024, 5:43 a.m. | Zhiheng Xi Wenxiang Chen Boyang Hong Senjie Jin Rui Zheng Wei He Yiwen Ding Shichun Liu

cs.LG updates on arXiv.org arxiv.org

In this paper, we propose R$^3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL), a novel method that employs only outcome supervision to achieve the benefits of process supervision for large language models. The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for optimization. Outcome supervision provides sparse rewards for final results without identifying error locations, whereas process supervision offers step-wise rewards but requires …

benefits challenge core cs.ai cs.cl cs.lg curriculum identify language language models large language large language models novel outcome supervision paper process process supervision reasoning reinforcement reinforcement learning supervision through training

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Data Scientist (Database Development)

@ Nasdaq | Bengaluru-Affluence