Learning High-Order Substructure Association from Molecules with Transformers

Hoang Thanh Lam; Raul Fernandez-Diaz; Vanessa López

Learning High-Order Substructure Association from Molecules with Transformers

Hoang Thanh Lam, Raul Fernandez-Diaz, Vanessa López

27 Sept 2024 (modified: 21 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Molecule, repesentation learning, drug discovery, ADMET, drug properties prediction

Abstract: Molecular graphs are commonly represented using SMILES (Simplified Molecular Input Line Entry System) strings, enabling the transformation of molecular graphs into token sequences. While transformers—powerful neural networks originally developed for natural language processing—have been adapted for learning molecular representations from SMILES by predicting masked tokens, they have yet to achieve competitive performance on ADMET benchmark datasets crucial for assessing drug properties such as absorption, distribution, metabolism, excretion, and toxicity. This paper identifies the challenge that traditional random token masking in SMILES overlooks essential molecular substructures, leading transformers to focus on superficial correlations between individual tokens rather than their relationships within substructures. We propose a novel approach that enhances transformers' capability to recognize molecular substructures by introducing a substructure-aware masking strategy alongside a new learning objective. This method embeds substructure information directly into the masking and prediction process, allowing the model to predict specific subgraphs instead of random tokens. Our experiments demonstrate that transformers employing this dual innovation outperform those utilizing conventional random masking, resulting in improved predictions of drug-related properties on ADMET benchmarks. This work contributes to the ongoing advancement of transformer architectures in the field of molecular representation learning.

Supplementary Material: zip

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10246

Loading