Keywords: Flexible docking, generative modeling, long-tailed data
Abstract: As generative modeling is increasingly applied to scientific challenges, an emerging issue is overfitting to noisy data, especially as synthetic data becomes more widely used for training. In this work, we provide the explicit observation of noise related overfitting in modern generative models for scientific applications particularly for flexible docking. In this task, the generative model is trained to model protein conformational shift from apo to holo state as well as ligand binding pose, where the apo structure is predicted by a large-scale deep learning model. The resulting apo-holo pairs exhibit a long-tailed structure shift distribution, which the model must learn effectively. Motivated by this observation, we propose robust training techniques for generative models, including asynchronized time schedule and inverse shift sigmoid reweighting, supported by both theoretical analysis and empirical results. By effectively reweighting the loss, model regularization, improved confidence model training and fine grained control of robustness over protein degrees of freedom, our \textsc{Robustdock} achieves state-of-the-art performance on the PDBBind benchmark and increases by 13.6 percentage points (of ligand RMSD $<$ 2 Å) on the large conformational shift subset.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 107
Loading