FARM: Enhancing Molecular Representations with Functional Group Awareness

TMLR Paper7902 Authors

12 Mar 2026 (modified: 27 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES (a linear string representation of molecular structures), natural language, and molecular graphs. The key innovation of FARM lies in its functional group (FG) annotation at the atomic level, which enables both FG-enhanced SMILES and FG graphs. SMILES are enriched with FG information to specify which functional group each atom belongs to, while the FG graph captures the molecular backbone by showing how the functional groups are connected. This tokenization not only injects chemical knowledge into SMILES but also expands the chemical lexicon, effectively bridging the gap between SMILES and natural language in terms of vocabulary size, making the sequences more suitable for Transformer-based models. FARM learns molecular representations from two complementary perspectives to encode both functional and structural information. Masked language modeling on FG-enhanced SMILES captures atom-level features enriched with FG context, while graph neural networks encode higher-level molecular topology by representing how functional groups are connected. Contrastive learning then aligns these two views into a unified embedding, ensuring that atom-level details and FG-level structure are jointly represented. This dual-level modeling is central to FARM’s ability to predict molecular properties accurately. We rigorously evaluate FARM on the MoleculeNet dataset, achieving state-of-the-art performance on 11 out of 13 tasks, and further validate its generalization on the photostability dataset for quantum mechanics properties. These results highlight FARM’s potential to improve molecular representation learning, demonstrate strong transfer learning capabilities across drug discovery and material design domains, and enable broad applications in pharmaceutical research and functional materials development. The code is available at: https://anonymous.4open.science/r/farm-E3CB
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=2QGXeWfqbX
Changes Since Last Submission: - Added a brief explanation of SMILES in the abstract for better accessibility to readers outside the chemistry domain. - Revised the abstract to remove overly specific details and improve clarity and generality. - Updated the anonymous GitHub repository link to an active version for code access. - Corrected and clarified the contrastive learning formula to improve technical correctness. - Added additional descriptions of figures to better illustrate (1) masked language modeling on FG-enhanced SMILES and (2) contrastive learning between SMILES representations and molecular graphs, providing clearer insight into how these components help learn meaningful molecular representations. - Fixed minor typos and improved wording throughout the manuscript.
Assigned Action Editor: ~Christopher_Morris1
Submission Number: 7902
Loading