Feb. 29, 2024, 5:48 a.m. | Shuo Yang, Gjergji Kasneci

cs.CL updates on arXiv.org arxiv.org

arXiv:2402.18284v1 Announce Type: new
Abstract: Wide usage of ChatGPT has highlighted the potential of reinforcement learning from human feedback. However, its training pipeline relies on manual ranking, a resource-intensive process. To reduce labor costs, we propose a self-supervised text ranking approach for applying Proximal-Policy-Optimization to fine-tune language models while eliminating the need for human annotators. Our method begins with probabilistic sampling to encourage a language model to generate diverse responses for each input. We then employ TextRank and ISODATA algorithms …

abstract arxiv bank breaking chatgpt cost costs crowdsourcing cs.ai cs.cl feedback fine-tuning human human feedback labor language language models optimization pipeline policy process ranking reduce reinforcement reinforcement learning text text ranking training training pipeline type usage

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne