FARM: Functional Group-Aware Representations for Small Molecules

Thao Nguyen; Kuan-Hao Huang; Ge Liu; Martin D. Burke; Ying Diao; Heng Ji

FARM: Functional Group-Aware Representations for Small Molecules

Thao Nguyen, Kuan-Hao Huang, Ge Liu, Martin D. Burke, Ying Diao, Heng Ji

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Molecular representation learning, Masked language modeling, Contrastive learning

TL;DR: This paper presents FARM, a model that enhances small molecule representation by integrating functional module-aware tokenization with masked language modeling and graph neural networks, achieving state-of-the-art performance on MoleculeNet.

Abstract: We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group-aware tokenization, which directly incorporates functional group information into SMILES, enriching SMILES with detailed chemical context. For example, instead of using “O” to represent all oxygen atoms, we use specific tokens like “O_ketone” and “O_hydroxyl” to differentiate oxygen atoms belonging to distinct functional groups. This tokenization expands the chemical lexicon, thereby more effectively bridging SMILES and natural language, ultimately enhancing the model’s ability to predict molecular properties. FARM also represents molecules from two perspectives: by using masked language modeling to capture atom-level features and by employing graph neural networks to encode the whole molecule topology. FARM leverages contrastive learning to aligns these two views of representations into a unified molecular embedding. We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 11 out of 13 tasks. These results highlight FARM’s potential to improve molecular representation learning and demonstrate its strong transfer learning capabilities, paving the way for promising applications in drug discovery and pharmaceutical research.

Supplementary Material: zip

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9334

Loading