Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: sample efficiency, generative design, bayesian optimization, molecular design
TL;DR: SEGO couples a language-based generative model with Bayesian optimization over LLM-learned molecular representations, attaining state-of-the-art on the PMO benchmark despite using only one tenth of the oracle budget.
Abstract: Discovering optimal molecules, whether in drug discovery, materials design, or catalyst optimization, often requires navigating large chemical spaces with very limited data. Two families of methods have emerged: Bayesian Optimization (BO), which is highly sample efficient but typically operates over fixed, user-defined libraries, and goal-directed generative models, which can explore chemical space freely but often require hundreds or thousands of oracle calls to find promising candidates. This creates a practical tension between sample efficiency and chemical space coverage, yet both are essential for real-world campaigns in which the target region of chemical space is unknown and data is expensive to collect. We introduce Sample Efficient Generative Optimization (SEGO), a framework combining generative modeling with Bayesian optimization for molecular discovery. At each iteration, SEGO uses a surrogate model to focus generation on promising regions of chemical space, then applies BO over the resulting candidates to select the most informative molecules for evaluation. SEGO attains state-of-the-art on the Practical Molecular Optimization benchmark in only one tenth of the oracle calls consumed by other methods, opening the door to optimization campaigns driven by direct experimental feedback.
Submission Number: 315
Loading