Interpretable cancer driver gene prediction from DNA sequences using a genomic language model

ICLR 2025 Workshop LMRL Submission72 Authors

12 Feb 2025 (modified: 18 Apr 2025)Submitted to ICLR 2025 Workshop LMRLEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny Paper Track
Keywords: genomic language model, cancer driver gene, post-hoc explainability
TL;DR: A new interpretable method to predict cancer driver genes from raw DNA sequences based on a genomic language model
Abstract: Cancer driver genes are usually detected as positively selected genes with high fitness. We hypothesize that subsequences within these genes carry signals of positive selection that can be learned by genomic language models (gLMs). In this work, we fine-tuned Caduceus, a high-performing long-range gLM, to predict cancer driver genes from DNA sequences. Post-hoc interpretations of our fine-tuned model helped to explain important sequence features associated with gene fitness such as known somatic mutations in driver genes. Our approach generates meaningful representations of DNA sequences related to cancer driver genes and provides a framework toward interpretable cancer driver gene prediction.
Submission Number: 72
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview