Keywords: Large Language Models, Model Interpretability, Evolutionary Algorithms, Program Synthesis, Regulatory Genomics
Abstract: Deep learning models have achieved state-of-the-art performance in predicting complex regulatory tasks. Yet their black-box nature often limits the discovery of new biological insights. Here we present LLMGEN, a framework that leverages large language models (LLMs) and evolutionary algorithms to automate the discovery of compact, human-readable symbolic programs. LLMGEN integrates multiple modalities—including biological sequences, functional readouts, natural language descriptions, and executable Python functions—to bridge complex genomic models with interpretable rules. Inspired by recent program-synthesis frameworks such as FunSearch and AlphaEvolve, LLMGEN adapts LLM-guided program evolution to the genomic domain, with prior-guided seeding using biologically relevant attribution features to improve convergence. Across datasets including CRISPRi screens, STARR-seq enhancer assays, and ATAC-seq chromatin accessibility profiles, LLMGEN evolves concise prediction rules that are competitive with deep learning models, rediscovers known motifs and interactions, and generates testable mechanistic hypotheses. These results demonstrate that LLM-guided program evolution is a flexible, model-agnostic approach for building interpretable genomic predictors, advancing multi-modal foundation models toward trustworthy and transparent AI in the life sciences.
Submission Number: 64
Loading