Leveraging Hybrid Representations for Robust Molecular Property Prediction in Low-Data Regimes

Published: 10 Mar 2026, Last Modified: 11 May 2026IEEE ICDMEveryonearXiv.org perpetual, non-exclusive license
Abstract: Molecular machine learning faces a persistent tradeoff between interpretability and predictive accuracy. Descriptors and fingerprints provide chemically meaningful features but limited predictive power, while learned representations from graph neural networks (GNNs) or SMILES-based models achieve high accuracy at the expense of transparency. In this study, we restrict experiments to descriptors and fixed fingerprints, leaving embeddings as a future extension. We conduct a systematic evaluation of hybrid molecular representations that combine descriptors with fingerprints for property prediction across classification and regression tasks. On the BBBP (blood-brain barrier permeability), ESOL (aqueous solubility), and FreeSolv (hydration free energy) datasets, hybrids yield up to 7 % higher ROC-AUC than descriptors and reduce RMSE by up to 48% relative to fingerprints, with the largest gains in low-data regimes (10−25% of training data). Ablation studies show that descriptors and fingerprints provide complementary signals, and feature analyses confirm that interpretability is preserved. By quantifying robustness under both full-data and data-scarce settings, this study demonstrates that hybrid feature fusion is an effective and reliable strategy for molecular property prediction. The complete framework is integrated into the Hands-On Data Science for Chemists platform to support reproducibility and adoption.
Loading