General-purpose Pre-trained Model Towards Cross-domain Molecule Learning

17 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: molecular representation learning, self-supervised pre-training, multimodal learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: A pre-trained foundation model for general-purpose molecular modeling.
Abstract: Self-supervised pre-training on biomolecules has achieved remarkable success in various biochemical applications, such as drug discovery and protein design. However, in most approaches, the learning model is primarily constructed based on the characteristics of either small molecules or proteins, without exploring their potential binding interactions -- an essential cross-domain relationship crucial for driving numerous biological processes. In this paper, inspired by the success of multimodal learning, we fill this gap by proposing a general-purpose foundation model named **BIT** (an abbreviation for **B**iomolecular **I**nteraction **T**ransformer), which is capable of encoding a range of biochemical entities, including small molecules, proteins, and protein-ligand complexes, as well as various data formats, encompassing both 2D and 3D structures, all within a shared Transformer backbone, via multiple unified self-supervised atom-level *denoising* tasks. We introduce *Mixture-of-Domain-Experts* (MoDE) to handle the biomolecules from diverse chemical domains and incorporate separate structural channels to capture positional dependencies in the molecular structures. The proposed MoDE allows BIT to enable both deep fusion and domain-specific encoding and learn cross-domain relationships on protein-ligand complexes with 3D cocrystal structures. Experimental results demonstrate that BIT achieves exceptional performance in both protein-ligand binding and molecular learning downstream tasks, including binding affinity prediction, virtual screening, and molecular property prediction.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 942
Loading