Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

Gongxu Luo; Haoyue Dai; Loka Li; Chengqian Gao; Boyang Sun; Kun Zhang

Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders

Gongxu Luo, Haoyue Dai, Loka Li, Chengqian Gao, Boyang Sun, Kun Zhang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Selection bias; Causal discovery; latent confounders; Gene regulatory newwork inference

Abstract: Gene regulatory network inference (GRNI) aims to discover how genes causally regulate each other from gene expression data. It is well-known that statistical dependencies in observed data do not necessarily imply causation, as spurious dependencies may arise from *latent confounders*, such as non-coding RNAs. Numerous GRNI methods have thus been proposed to address this confounding issue. However, dependencies may also result from *selection*—only cells satisfying certain survival or inclusion criteria are observed—while these selection-induced spurious dependencies are frequently overlooked in gene expression data analyses. In this work, we show that such selection is ubiquitous and, when ignored or conflated with true regulations, can lead to flawed causal interpretation and misguided intervention recommendations. To address this challenge, a fundamental question arises: can we distinguish dependencies due to regulation, confounding, and crucially, selection? We show that gene perturbations offer a simple yet effective answer: selection-induced dependencies are *symmetric* under perturbation, while those from regulation or confounding are not. Building on this motivation, we propose GISL (Gene regulatory network Inference in the presence of Selection bias and Latent confounders), a principled algorithm that leverages perturbation data to uncover both true gene regulatory relations and non-regulatory mechanisms of selection and confounding up to the equivalence class. Experiments on synthetic and real-world gene expression data demonstrate the effectiveness of our method.

Supplementary Material: zip

Primary Area: Machine learning for sciences (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 17977

Loading