Comparative Performance of EI-MS Spectrum Prediction Models under Data-scarce and Domain-imbalanced Settings
Keywords: Chemoinfomatics, Chemical foundation model, EI-MS
Abstract: Electron ionization mass spectrometry (EI-MS) is widely used for chemical identification, yet collecting experimental spectra for newly emerging or domain-specific molecules remains time-consuming and costly. Recent chemical foundation models pre-trained on large-scale molecular corpora offer a promising approach for addressing data scarcity in such settings, but their effectiveness under domain-imbalanced conditions has not been sufficiently examined.
In this study, we compare a conventional MLP-based model (NEIMS) with a fine-tuned chemical foundation model based on MolFormer-XL for EI-MS spectrum prediction under controlled few-shot conditions. Focusing on fluorine-containing molecules as a concrete example of a domain-specific subset, we vary the number of such molecules in the training data while keeping evaluation settings fixed. Across all examined conditions, the MolFormer-based model achieves higher spectral similarity and peak-level precision than NEIMS for fluorine-containing molecules.
These results suggest that molecular representations learned through large-scale pre-training can be effectively leveraged for EI-MS spectrum prediction even when domain-specific training data are sparse. Our findings provide practical reference information for model selection under domain-imbalanced data conditions in automated material characterization tasks.
Submission Track: Findings, Tools, & Open Challenges (Tiny Paper)
Submission Category: Automated Material Characterization
Submission Number: 58
Loading