all AI news
PPO self-play, probability sampling instead of highest probability
I read a paper in which they use PPO to learn a game with one opponent.
They only use the experiences of one agent to update the network and use the same network for both agents.
My question is why they are using probability sampling for the opponent's actions and not the action with the highest probability since we want our opponent to use the best action for a given state?
And wouldn't it always be better to train …
More from reddit.com / Reinforcement Learning
Data Analyst, Patagonia Action Works
@ Patagonia | Remote
Data & Insights Strategy & Innovation General Manager
@ Chevron Services Company, a division of Chevron U.S.A Inc. | Houston, TX
Faculty members in Research areas such as Bayesian and Spatial Statistics; Data Privacy and Security; AI/ML; NLP; Image and Video Data Analysis
@ Ahmedabad University | Ahmedabad, India
Director, Applied Mathematics & Computational Research Division
@ Lawrence Berkeley National Lab | Berkeley, Ca
Business Data Analyst
@ MainStreet Family Care | Birmingham, AL
Assistant/Associate Professor of the Practice in Business Analytics
@ Georgetown University McDonough School of Business | Washington DC