SymSpectra: Symmetric Information Bottleneck Framework for Molecular Structure Recognition under Imbalanced Settings
Keywords: Molecular structure identification; Data imbalance; Information bottleneck;
TL;DR: ymSpectra is a multi-modal spectral learning framework that leverages conditional information bottleneck to robustly predict molecular structures under data imbalance.
Abstract: Identifying molecular structures from spectral data is essential for early-stage chemical analysis, yet it remains a difficult task due to the imbalance in functional group distributions. Current methods often overfit to prevalent groups while neglecting underrepresented ones, failing to capture key dependencies between functional groups. This highlights the need for a unified approach that addresses both data imbalance and structural constraints. In this work, we present \textbf{SymSpectra}, a \textbf{Sym}metric Conditional Information Bottleneck (SCIB) framework designed to seamlessly integrate multi-modal \textbf{Spectra} features. Our model employs the SCIB framework to fuse multi-modal spectroscopic data into a unified representation, effectively preserving discriminative signals while mitigating redundancy. To enhance robustness against data imbalance, we incorporate conditional mutual information into the training objective, increasing the model’s sensitivity to rare functional groups and challenging molecular cases. Additionally, a specialized module captures the dependencies among functional groups, improving both prediction accuracy and chemically meaningful interpretability. Experiments on multimodal spectral datasets demonstrate that SymSpectra significantly outperforms state-of-the-art methods, achieving an F1-score of 0.970 in substructure classification. More importantly, SymSpectra consistently outperforms baselines under various imbalanced scenarios, exhibiting superior robustness and generalizability, which may help advance the automation of chemical discovery. Our code can be found at \href{https://anonymous.4open.science/r/SymSpectra-0017}{https://anonymous.4open.science.}
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 6578
Loading