Spectral Analysis of Molecular Kernels: When Richer Features Do Not Guarantee Better Generalization

Asma Jamali; Tin Sum Cheng; Rodrigo Vargas-Hernandez

Spectral Analysis of Molecular Kernels: When Richer Features Do Not Guarantee Better Generalization

Asma Jamali, Tin Sum Cheng, Rodrigo Vargas-Hernandez

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: molecular kernels, unsupervised learning, learning representation, pretrained Molecular embedding models, self-supervised learning

TL;DR: Spectral analysis on various molecular kernels show that richer spectra do not correlate to better performance in general.

Abstract: Understanding the spectral properties of kernels offers a principled perspective on generalization and representation quality. While deep models achieve state-of-the-art accuracy in molecular property prediction, kernel methods remain widely used for their robustness in low-data regimes and transparent theoretical grounding. Despite extensive studies of kernel spectra in machine learning, systematic spectral analyses of molecular kernels are scarce. In this work, we provide the first comprehensive spectral analysis of kernel ridge regression on the QM9 dataset, molecular fingerprint, pretrained transformer-based, global and local 3D representations across seven molecular properties. Surprisingly, richer spectral features, measured by four different spectral metrics, do not consistently improve accuracy. Pearson correlation tests further reveal that for transformer-based and local 3D representations, spectral richness can even have a negative correlation with performance. We also implement truncated kernels to probe the relationship between spectrum and predictive performance: in many kernels, retaining only the top 2\% of eigenvalues recovers nearly all performance, indicating that the leading eigenvalues capture the most informative features. Our results challenge the common heuristic that “richer spectra yield better generalization” and highlight nuanced relationships between representation, kernel features, and predictive performance. Beyond molecular property prediction, these findings inform how kernel and self-supervised learning methods are evaluated in data-limited scientific and real-world tasks.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 18501

Loading