Web: http://arxiv.org/abs/2204.09315

Sept. 19, 2022, 1:12 a.m. | Hung Le, Thommen Karimpanal George, Majid Abdolshah, Dung Nguyen, Kien Do, Sunil Gupta, Svetha Venkatesh

cs.LG updates on arXiv.org arxiv.org

We introduce a constrained optimization method for policy gradient
reinforcement learning, which uses a virtual trust region to regulate each
policy update. In addition to using the proximity of one single old policy as
the normal trust region, we propose forming a second trust region through
another virtual policy representing a wide range of past policies. We then
enforce the new policy to stay closer to the virtual policy, which is
beneficial if the old policy performs poorly. More importantly, …

arxiv optimization policy trust virtual

