Stabilizing Policy Gradients for Stochastic Differential Equations by enforcing Consistency with Perturbation Process

18 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: stochastic differential equation; reinforcement learning; diffusion models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Deep neural networks parameterized stochastic differential equations (SDEs) received increasing attention from the machine learning community due to their high expressiveness and solid theoretical foundations, with a wide range of applications in generative models. However, maximizing likelihood of training data, the objective of generative models, does not always meet our requirements in many real-world problems. Fortunately, introducing reinforcement learning (e.g., policy gradient) here to maximize a reward, using SDE-based policy, may bridge this gap. Nevertheless, when applying policy gradients to SDEs, since the policy gradient is estimated on a finite set of trajectories, it can be ill-defined, and the policy behavior in data-scarce regions may be uncontrolled. These challenges compromise the stability of policy gradients and negatively impact sample complexity. To address these issues, we propose constraining the SDE to be consistent with its associated perturbation process. Since the perturbation process covers the entire space and is easy to sample, we can mitigate the aforementioned problems. Our framework offers a general approach for training SDEs using policy gradients, allowing for a versatile selection of policy gradients to effectively and efficiently train SDEs. We evaluate our algorithm on the task of structure-based drug design and optimize the binding affinity of generated ligand molecules. Our method achieves the best Vina score (-9.07) on the CrossDocked2020 dataset.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1430
Loading