Web: https://www.reddit.com/r/reinforcementlearning/comments/vgp0e8/double_qlearning_in_sb3s_sac_implementation/

June 20, 2022, 4:08 p.m. | /u/IAmMiddy

Reinforcement Learning reddit.com


According to [this change](https://stable-baselines3.readthedocs.io/en/master/misc/changelog.html#id59), SAC and TD3 in the SB3 implementation can take an arbitrary number of critics. Indeed, if we check the source code for e.g. [SAC's train function](https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/sac/sac.py#L189), we find:

next_q_values = th.cat(self.critic_target(replay_data.next_observations, next_actions), dim=1)
next_q_values, _ = th.min(next_q_values, dim=1, keepdim=True)
# ...
q_values_pi = th.cat(self.critic(replay_data.observations, actions_pi), dim=1)
min_qf_pi, _ = th.min(q_values_pi, dim=1, keepdim=True)

There, the minimum value of the n=2 critic networks is taken across the batch to calculate both the actor and critic loss.

I …

implementation learning q-learning reinforcementlearning

Machine Learning Researcher - Saalfeld Lab

@ Howard Hughes Medical Institute - Chevy Chase, MD | Ashburn, Virginia

Project Director, Machine Learning in US Health

@ ideas42.org | Remote, US

Data Science Intern

@ NannyML | Remote

Machine Learning Engineer NLP/Speech

@ Play.ht | Remote

Research Scientist, 3D Reconstruction

@ Yembo | Remote, US

Clinical Assistant or Associate Professor of Management Science and Systems

@ University at Buffalo | Buffalo, NY