Abstract: We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group (FG) annotation at the atomic level, which enables both FG-enhanced SMILES and FG graphs: SMILES are enriched with FG information to specify which functional group each atom belongs to, while the FG graph captures the molecular backbone by showing how the functional groups are connected. As an example of FG-enhanced SMILES, instead of using "O" to represent all oxygen atoms, we assign specific tokens like "O_ketone" and "O_hydroxyl" to differentiate oxygen atoms belonging to distinct FGs. This tokenization not only injects chemical knowledge into SMILES but also expands the chemical lexicon, effectively bridging the gap between SMILES and natural language in terms of vocabulary size, making the sequences more suitable for Transformer-based models. An example of an FG graph can be seen in methanol (CH3OH), which contains a hydroxyl group attached to a methyl group. In the FG graph, each FG is represented as a node: one for the hydroxyl (OH) and one for the methyl (CH3). FARM then learns molecules from two complementary perspectives to fully encode both functional and structural information. Masked language modeling on FG-enhanced SMILES captures atom-level features enriched with FG context, while graph neural networks encode higher-level molecular topology by representing how FGs are connected. Contrastive learning then aligns these two views into a unified embedding, ensuring that atom-level details and FG-level structure are jointly represented. This dual-level modeling is central to FARM’s ability to predict molecular properties accurately. We rigorously evaluate FARM on the MoleculeNet dataset, achieving state-of-the-art performance on 11 out of 13 tasks, and further validate its generalization on the photostability dataset for quantum mechanics properties. These results highlight FARM’s potential to improve molecular representation learning, demonstrate its strong transfer learning capabilities across drug discovery and material design domains, and make way for broad applications in pharmaceutical research and functional materials development. The code is available at: https://anonymous.4open.science/r/farm_molecular_representation-1E2F
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Alessandro_Sperduti1
Submission Number: 6057
Loading