Improving real-world sequence design with a simple meta-heuristic for detecting distribution shift

Farhan Damani; David H Brookes; Theodore Sternlieb; Cameron Webster; Stephen Malina; Rishi Jajoo; Kathy Lin; Sam Sinai

Improving real-world sequence design with a simple meta-heuristic for detecting distribution shift

Farhan Damani, David H Brookes, Theodore Sternlieb, Cameron Webster, Stephen Malina, Rishi Jajoo, Kathy Lin, Sam Sinai

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: protein engineering, sequence design, model-based optimization

TL;DR: We demonstrate that a particular application of binary classification can filter out-of-distribution sequences in model based optimization and demonstrate the effectiveness of this technique in a real-world problem.

Abstract: Biological sequence design is one of the most impactful areas where model-based optimization is applied. A common scenario involves using a fixed training set to train predictive models, with the goal of designing new sequences that outperform those present in the training data. This by definition results in a distribution shift, where the model is applied to samples that are substantially different from those in the training set (or otherwise they wouldn’t have a chance of being much better). While most MBO methods offer some balancing heuristic to control for false positives, finding the right balance of pushing the design distribution while maintaining model accuracy requires deep knowledge of the algorithm and artful application, limiting successful adoption by practitioners. To tackle this issue, we propose a straightforward meta-algorithm for design practitioners that detects distribution shifts when using any MBO. By doing a real-world sequence design experiment, we show that (1) Real world distribution shift is far more severe than observed in simulated settings, where most MBO algorithms are benchmarked (2) Our approach successfully reduces the adverse effects of distribution shift. We believe this method can significantly improve design quality for sequence design tasks and potentially other domain applications where offline optimization faces harsh distribution shifts.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11130

Loading