TL;DR: We propose a method for selecting algorithms for successful machine learning-guided design, with theoretical guarantees on the distributions of designs produced by the selected algorithms.
Abstract: Algorithms for machine learning-guided design, or *design algorithms*, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task—for example, to design novel proteins with high binding affinity to a therapeutic target—one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method for *design algorithm selection*, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion—for example, that at least ten percent of designs’ labels exceed a threshold. It does so by combining designs’ predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference (Angelopoulos et al., 2023). The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method’s effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios.
Lay Summary: New molecules, biological sequences, and materials believed to have useful properties are being computationally designed by algorithms. Given the cost of validating these new objects, or *designs*, in the laboratory, how can scientists reliably choose algorithms that will produce successful designs?
We propose a method to help scientists select algorithms that are guaranteed to generate successful pools of designs (or indicate if no algorithm can do so). To assess whether the designs produced by an algorithm will be successful, one could imagine getting predictions of how the designs will behave from a machine-learning system. However, predictions about designs can be fraught with errors, because designs can behave quite differently from the data that such systems are trained on. To overcome this, our approach uses additional data to characterize how prediction error distorts our assessments of whether designs are successful. It then uses statistical tools to undo this distortion and reliably forecast which algorithms will produce successful designs.
Our work offers a framework for how to design novel molecules, biological sequences, or materials in a way that is guided by machine-learning systems, yet not led astray by their errors. As such, it can help scientists increase the probability of success in ambitious design projects.
Primary Area: Applications->Health / Medicine
Keywords: machine learning-guided design, biological sequence design, uncertainty quantification, generative models, model-based optimization, prediction-powered inference, model selection
Submission Number: 5143
Loading