Keywords: gene expression prediction, epigenomics, multimodal, causal learning
Abstract: Gene expression prediction through DNA sequences and multimodal epigenomic signals integration presents significant challenges. Previous methods often focus on using epigenomic signals to locate distal enhancers and incorporate these enhancers into model development through long sequence modeling. Our experiments reveal that current long sequence modeling actually decreases performance, while proximal signals near target genes prove more essential. Furthermore, we find that different signals contribute varying degrees of performance gain. Simple use of all epigenomic signals may lead models to depend excessively on widespread background signals. These background signals act as confounders, causing the model to develop spurious dependencies. To overcome these issues, we propose InFER, which employs causal intervention through backdoor adjustment to eliminate model dependencies on potential confounding background epigenomic regulation. Our experimental results show that proper modeling of epigenomic regulation with short sequences alone can achieve state-of-the-art performance in gene expression prediction.
Submission Number: 10
Loading