all AI news
Removing RLHF Protections in GPT-4 via Fine-Tuning
April 9, 2024, 4:51 a.m. | Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang
cs.CL updates on arXiv.org arxiv.org
Abstract: As large language models (LLMs) have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protections. We may expect that the most powerful models currently available (GPT-4) are less susceptible to fine-tuning …
abstract arxiv capabilities cs.ai cs.cl enabling feedback fine-tuning gpt gpt-4 human human feedback language language models large language large language models llm llms reduce reinforcement reinforcement learning rlhf type vendors via
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Senior Data Science Analyst- ML/DL/LLM
@ Mayo Clinic | Jacksonville, FL, United States
Machine Learning Research Scientist, Robustness and Uncertainty
@ Nuro, Inc. | Mountain View, California (HQ)