LLM-Guided Retrieval for Prediction of Molecular Perturbation Responses

Betty Xiong; Jan-Christian Huetter; Gabriele Scalia; Tommaso Biancalani; Sepideh Maleki

LLM-Guided Retrieval for Prediction of Molecular Perturbation Responses

Betty Xiong, Jan-Christian Huetter, Gabriele Scalia, Tommaso Biancalani, Sepideh Maleki

Published: 02 Mar 2026, Last Modified: 08 May 2026MLGenX 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Predicting transcriptional responses to small-molecule perturbations is one of the main challenges in drug discovery, yet exhaustive profiling of drugs across cell lines is infeasible. While there has been many attempts to frame this task as supervised prediction, recent works show that simple aggregation baselines remain surprisingly strong, highlighting the importance of generalization rather than model complexity. In this work, we adopt a retrieval-based perspective on molecular perturbation prediction i.e. instead of directly predicting high-dimensional gene expression, we introduce LLM-Guided Retrieval (LGR), a framework in which a large language model acts as a selector to identify biologically relevant perturbations, whose observed transcriptional responses are then aggregated by a downstream predictor. This design leverages the biological reasoning capabilities of LLMs while avoiding their limitations in numerical generation. We evaluate LGR on large-scale single-cell perturbation data under closed-world and open-world regimes, including challenging settings with unseen drugs and unseen cell types. LGR achieves the strongest performance in unseen cell-line generalization and remains competitive with strong cell-mean baselines in open-world scenarios. Notably, LGR substantially improves directional (sign) accuracy of gene regulation, indicating better recovery of biologically meaningful perturbation effects even when magnitude-based metrics are comparable. These results demonstrates that LLM-guided retrieval as a simple and interpretable baseline for molecular perturbation prediction, particularly in settings characterized by unseen cellular contexts and limited training data.

Submission Number: 63

Loading