Enhancing Low-Cost Molecular Property Prediction with Contrastive Learning on SMILES Representations

Marcos G. Quiles, Piero A. L. Ribeiro, Gabriel A. Pinheiro, Ronaldo C. Prati, Juarez L. F. Da Silva

Published: 2024, Last Modified: 11 Feb 2025ICCSA (Workshops 9) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper explores self-supervised contrastive learning techniques in the Simplified Molecular Input Line Entry System (SMILES) representations. The contrastive model is responsible for pre-training the model before its downstream application on molecular property prediction. Central to our approach is the use of contrastive learning paradigms, in which pairs of distinct SMILES corresponding to the same molecule are treated as positive samples. In contrast, negative samples comprise pairs of SMILES that represent distinct molecules. This methodology aims to enrich the model’s understanding of molecular structures by emphasizing the invariance of molecular properties despite permutation in the SMILES representation, thereby enhancing the robustness and predictive accuracy of the model. We conducted a series of experiments on publicly available and proprietary datasets to evaluate the efficacy of our approach. Empirical results underscore the merits of the contrastive learning model over traditional methods, demonstrating improvements in molecular property prediction tasks. This study not only validates the potential of contrastive learning in the realm of chemoinformatics, but also sets a promising direction for future research in leveraging SMILES permutations for molecular analysis.