GLM-Prior: A Genomic Language Model for Sequence-Derived Prior Knowledge in GRN Inference

Published: 06 Oct 2025, Last Modified: 06 Oct 2025NeurIPS 2025 2nd Workshop FM4LS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Foundation Models, Finetuning, Genomic Language Model, Gene Regulatory Network Inference, Priors
TL;DR: GLM-Prior: a method for finetuning a pretrained foundation model to predict regulatory interactions between transcription factors and target genes for downstream GRN inference.
Abstract: Gene regulatory network (GRN) inference relies on high-quality prior knowledge, which are often incomplete or unavailable, particularly for complex organisms and diverse cell types. We present GLM-Prior, a genomic language model that fine-tunes the pretrained Nucleotide Transformer to learn transcription factor to target gene regulatory interactions from nucleotide sequence, yielding a sequence-derived prior for downstream GRN inference. In yeast, GLM-Prior outperforms motif-based and curated prior knowledge. When trained on general interaction data in human or mouse, GLM-Prior recovers cell line-specific regulatory structure and enables zero-shot transfer between species. Across settings, adding expression-based inference provides only modest improvements, indicating that most recoverable regulatory structure is capture by sequence features learned by GLM-Prior. These results support sequence-derived prior knowledge as a strong basis for GRN inference, with expression data used primarily to refine and contextualize a fixed regulatory scaffold.
Submission Number: 3
Loading