Jan. 20, 2022, 2:10 a.m. | Hengrui Zhang, Youfang Lin, Sheng Han, Shuo Wang, Kai Lv

cs.LG updates on arXiv.org arxiv.org

Safety exploration can be regarded as a constrained Markov decision problem
where the expected long-term cost is constrained. Previous off-policy
algorithms convert the constrained optimization problem into the corresponding
unconstrained dual problem by introducing the Lagrangian relaxation technique.
However, the cost function of the above algorithms provides inaccurate
estimations and causes the instability of the Lagrange multiplier learning. In
this paper, we present a novel off-policy reinforcement learning algorithm
called Conservative Distributional Maximum a Posteriori Policy Optimization
(CDMPO). At first, …

arxiv learning reinforcement learning safety

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

IT Data Engineer

@ Procter & Gamble | BUCHAREST OFFICE

Data Engineer (w/m/d)

@ IONOS | Deutschland - Remote

Staff Data Science Engineer, SMAI

@ Micron Technology | Hyderabad - Phoenix Aquila, India

Academically & Intellectually Gifted Teacher (AIG - Elementary)

@ Wake County Public School System | Cary, NC, United States