NumMolFormer: Enhancing Transformer Numerical Reasoning for Functional-Group-Based Molecule Generation

ICLR 2026 Conference Submission18871 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Structure-based Drug Discovery, Molecule Generation, Transformer
Abstract: Structure-based drug design critically depends on effectively identifying the active molecular structures. Functional groups serve as local active centers and must be optimally balanced, with excess diminishing specificity and scarcity limiting activity. However, most existing methods model molecules at the atom–bond level rather than at the functional group level, making it difficult to control the quantity of functional groups. To address this, we propose NumMolFormer, a novel molecular generation framework that integrates functional group knowledge with numerical modeling. NumMolFormer employs a dual-sequence representation that jointly encodes text sequence tokens of functional groups with their quantitative information, enhanced by a numerical embedding module that leverages symbol–magnitude decomposition and soft magnitude quantization to capture numerical features. Furthermore, we introduce a dual-stream differential attention mechanism to explicitly disentangle textual and numerical contributions. To overcome data scarcity, We build a 18 million molecule dataset with functional group annotations for pretraining, followed by self-supervised and RL-based fine-tuning on protein pockets. Experimental results demonstrate that NumMolFormer can effectively control functional groups in molecular generation and produce molecules with enhanced activity, synthesizability, and drug-likeness when conditioned on protein pockets. The code is available at \url{https://github.com/alan-tsang/NumMolFormer}.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 18871
Loading