Keywords: multiple sequence alignment, reaction SMILES, organic chemistry, cheminformatics
Abstract: Multiple sequence alignment (MSA) establishes consistent positional correspondences across biological sequences, enabling comparative analysis and model interpretability. Drawing inspiration from this paradigm, we introduce an unsupervised framework for multiple alignment of chemical reactions represented as atom-mapped SMILES. Each reaction is decomposed into a set of atom-centered SMARTS fragments, which serve as local sequence analogues describing atomic environments and transformations. These SMARTS-derived features are embedded into a global token space whose parameters are optimized through a fast expectation–maximization (EM) procedure. During training, a ranking-based objective learns both the global embedding and positional multipliers that weight atomic contexts, while periodic recanonicalization of the SMARTS templates reorders reactants and token structures according to the evolving embedding. This feedback loop yields a self-consistent embedding that captures reaction role semantics and structural correspondences across the dataset. The final learned embedding is then applied to recanonicalize full reaction SMILES, producing standardized, embedding-guided representations that function analogously to consensus sequences in bioinformatics. The resulting canonicalization enables unsupervised role discovery, reaction alignment, and mechanistically interpretable representations for downstream modeling and generation. A case study illustrates the framework’s ability to recover consistent reactant-role orderings and reaction family structure across diverse transformation classes.
Submission Number: 35
Loading