Offline Reinforcement Learning with Mixture of Deterministic Policies
Abstract: Offline reinforcement learning (RL) has recently attracted considerable attention as an approach for utilizing past experiences to learn a policy. Recent studies have reported the challenges of offline RL, such as estimating the values of actions that are outside the data distribution. To mitigate offline RL issues, we propose an algorithm that leverages a mixture of deterministic policies. When the data distribution is multimodal, fitting a policy modeled with a unimodal distribution, such as Gaussian distribution, may lead to interpolation between separate modes, thereby resulting in the value estimation of actions that are outside the data distribution. In our framework, the state-action space is divided by learning discrete latent variables, and the sub-policies corresponding to each region are trained. The proposed algorithm was derived by considering the variational lower bound of the offline RL objective function. We show empirically that the use of the proposed mixture policy can reduce the accumulation of the critic loss in offline RL, which was reported in previous studies. Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Sept. 7 - Add the description regarding the gradient approximation for the discrete latent variable in Section 6 - Revised Equation (6) for clarity - Add the link to our codes August 23 - Updated the sentence in the introduction to address the concern raised by Reviewer DypM August 8 - Added LP-AWAC to Table 1 to summarize the implementation of baseline methods - Added the comparison with LP-AWAC on the toy problem in Section 7.1 - Replaced the results of LAPO- with LP-AWAC in Section 7.2 - Added description of the implementation of LP-AWAC in Appendix F. August 3 - Modified the introduction and added Section 8 “limitation of the proposed method” - Revised the explanation of the prior in Equation 9 - Added a paragraph “Approximation gap” on page 5 to discuss the gap between the objective function of AWAC and DMPO - Revised the description of mutual-information-based regularization on page 5 - Revised Section 6 to describe how to model the discrete latent variable - Added the comparison with Diffusion QL - Revised a sentence in Section 7.2 - Added Section 8 “Limitation of the proposed method” - Added how to select the hyperparameter of infoDMPO in Appendix F - Added the implementation details of mixAWAC in Appendix F July 12 - Add Gulcehre et al. (2020) as a reference for the one-step RL. July 11 - Changed "equation x" to "Equation x" - Modified unintentional lowercase letters in references - in page 17, "3.0GHz" -> "3.0 GHz"
Assigned Action Editor: ~Bo_Dai1
Submission Number: 1321