ERVNet: A Three-Module Framework for Predicting Endogenous Retrovirus Reactivation, Gene Propagation, and Immunogenicity

Published: 28 May 2026, Last Modified: 28 May 2026ICML 2026 FM4LS Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: endogenous retroviruses, multi-resolution encoding, epigenomics, multimodal deep learning, chromatin state prediction, graph neural networks, immunotherapy target discovery, transposable elements, histone modifications, cancer genomics
TL;DR: ERVNet is the first multimodal framework to predict ERV reactivation from chromatin state, using resolution-matched histone encoding (MREE), Hi-C graph propagation (CASCADE), and immunogenicity scoring (MIMIC) across 725K loci and 25 cell types.
Abstract: Endogenous retroviruses (ERVs) comprise 8% of the human genome across ~725K loci and can reactivate in cancer, triggering potent antitumor immunity via viral mimicry—yet all existing tools quantify ERV expression post-hoc from RNA-seq, and none predict which loci will reactivate prospectively from chromatin state. This gap limits rational immunotherapy target selection, as the vast locus space makes experimental screening intractable without computational prioritization. We present ERVNet, a three-module multimodal pipeline that bridges this gap. Its core innovation, Multi-Resolution Epigenomic Encoding (MREE), tokenizes each histone mark at its characteristic domain scale—75bp for narrow activation peaks like H3K27ac, 225bp for broad repressive domains like H3K9me3—rather than forcing all marks through a single resolution. CASCADE then propagates predicted activation through real Hi-C chromatin loops via a heterogeneous GNN to model downstream gene dysregulation, and MIMIC classifies ERV subfamily immunogenicity from sequence-derived features validated against real DNMTi perturbation data. Resolution-matched encoding improves AUPRC by 16.4% over single-resolution baselines, with MREE achieving LOCO-CV AUROC 0.984 across 25 cell types and cross-dataset AUROC 0.969 on independently generated ENCODE data. CASCADE captures gene-level consequences at Pearson r=0.422, while MIMIC attains AUROC 0.749 with permutation p<0.002 and external validation ρ=0.118. Gradient attribution reveals why the multi-resolution design works: H3K9me3 at coarse resolution provides the dominant discriminative signal, capturing 10–50kb heterochromatin domain boundaries that uniform encoding fragments—a finding that reconciles the apparent paradox between dropout analysis (H3K27ac dominant) and saliency mapping (H3K9me3 dominant). Applying the full pipeline across 25 cell types identifies 336 cancer-specific ERV loci with 1.43× enrichment over somatic tissues, yielding a tractable candidate set for experimental immunotherapy validation. More broadly, the principle of encoding each genomic feature at its characteristic spatial scale may generalize to other multimodal epigenomic prediction tasks where input signals span diverse resolutions.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 130
Loading