TL;DR: A novel framework that stitches molecules from an offline dataset to fine-tune the generative model for offline multi-objective molecular optimization
Abstract: Molecular discovery has attracted significant attention in scientific fields for its ability to generate novel molecules with desirable properties. Although numerous methods have been developed to tackle this problem, most rely on an online setting that requires repeated online evaluation of candidate molecules using the oracle. However, in real-world molecular discovery, the oracle is often represented by wet-lab experiments, making this online setting impractical due to the significant time and resource demands. To fill this gap, we propose the Molecular Stitching (MolStitch) framework, which utilizes a fixed offline dataset to explore and optimize molecules without the need for repeated oracle evaluations. Specifically, MolStitch leverages existing molecules from the offline dataset to generate novel `stitched molecules' that combine their desirable properties. These stitched molecules are then used as training samples to fine-tune the generative model using preference optimization techniques. Experimental results on various offline multi-objective molecular optimization problems validate the effectiveness of MolStitch. The source code is available online.
Lay Summary: Scientists are increasingly using artificial intelligence (AI) to help discover new molecules — such as those that could lead to new medicines or advanced materials. In computer simulations, an AI model can generate and evaluate thousands of molecular candidates in just seconds. But in the real world, evaluating even a single molecule often requires wet lab experiments, where chemists or biologists physically synthesize the molecule and test how well it performs. These experiments are expensive and can take weeks or even months to complete.
This slow feedback from wet lab experiments creates a frustrating bottleneck: the AI model suggests promising molecules, then sits idle for weeks or months while chemists conduct physical experiments to test them. During this waiting period, the AI isn't learning or improving, essentially wasting valuable time that could be spent getting better at its job.
In our research, we asked: While waiting for new experimental results, can we make better use of the molecular data we already have? Is it possible to keep training the AI model even without immediate feedback from wet lab evaluations?
To explore this, we developed MolStitch, an offline framework that allows AI to continue learning during these waiting periods. At the core of MolStitch is a proxy model — a model trained on past experimental data that can compare pairs of molecules and predict which one is more promising. This proxy feedback provides ongoing guidance to the AI, allowing it to keep improving even without new wet lab results.
MolStitch also creates new molecules by combining parts of existing molecules — a process we call molecular stitching. These newly generated molecules are then evaluated by the proxy model, which provides valuable feedback to the AI model, teaching it to recognize patterns and propose increasingly better molecules over time.
In short, MolStitch makes the real-world molecular discovery process more efficient. Instead of the AI sitting idle during long waiting periods, it is constantly learning and improving. When new experimental results finally arrive from the wet lab, the AI is already much better at its job and can suggest the next batch of promising molecular candidates.
Primary Area: Applications->Chemistry, Physics, and Earth Sciences
Keywords: AI for Science, Offline Model-based Optimization, Molecular Optimization
Submission Number: 9317
Loading