Feb. 14, 2024, 5:41 a.m. | Benjamin Plaut Hanlin Zhu Stuart Russell

cs.LG updates on arXiv.org arxiv.org

Most reinforcement learning algorithms with formal regret guarantees assume all mistakes are reversible and rely on essentially trying all possible options. This approach leads to poor outcomes when some mistakes are irreparable or even catastrophic. We propose a variant of the contextual bandit problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff each round represents the chance of avoiding catastrophe that round, and try to maximize the product of payoffs (the overall …

algorithms catastrophe continuous cs.ai cs.lg leads mistakes reinforcement reinforcement learning spaces

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Alternance DATA/AI Engineer (H/F)

@ SQLI | Le Grand-Quevilly, France