PairIT: Autoregressive Transformers for Low-Data Molecule Optimization
Keywords: molecular optimization, molecular generation, seeded generation, multi-property optimization, low-data learning, autoregressive transformers
TL;DR: Matched-pair training enables small autoregressive transformers to perform seeded multi-property molecule optimization effectively in low-data regimes.
Abstract: Generating novel molecules with desired properties is a central challenge in drug discovery. Although foundational work focused on de novo generation, practical use cases require a more complex setting: conditioning on an initial seed molecule, optimizing multiple properties, and operating in data-sparse regimes. We introduce PairIT, a flexible training framework for autoregressive transformers over SMILES sequences that enables seeded generation by learning from pairs of structurally similar molecules. By leveraging matched pairs, the model learns to generate local, chemically meaningful edits that improve a property of the seed molecule. Prompt tokens extend this framework by specifying the target property, desired direction, and approximate magnitude of change, and by allowing multiple property objectives to be composed at inference time. PairIT naturally supports pretraining on a large dataset and fine-tuning on a smaller dataset, which is critical for performance in data-sparse regimes. Surprisingly, we observe diminishing returns to the scale of the pretraining dataset. We find, moreover, that reinforcement learning post-training can enhance generation, providing a tunable knob for trading off proximity to the seed with the optimization objective. We demonstrate empirically on ZINC-250K and MoleculeNet that, with appropriate paired-data curation, a standard transformer model of modest size (68M parameters) can achieve robust seeded molecular optimization in low-data regimes.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 41
Loading