Track: Tiny Paper Track
Keywords: genomic language model, cancer driver gene, post-hoc explainability
TL;DR: A new interpretable method to predict cancer driver genes from raw DNA sequences based on a genomic language model
Abstract: Cancer driver genes are usually detected as positively selected genes with high fitness. We hypothesize
that subsequences within these genes carry signals of positive selection that can be learned by
genomic language models (gLMs). In this work, we fine-tuned Caduceus, a high-performing long-range
gLM, to predict cancer driver genes from DNA sequences. Post-hoc interpretations of our
fine-tuned model helped to explain important sequence features associated with gene fitness such
as known somatic mutations in driver genes. Our approach generates meaningful representations of
DNA sequences related to cancer driver genes and provides a framework toward interpretable cancer
driver gene prediction.
Submission Number: 72
Loading