Keywords: Genomic language models, Knowledge Distillation, DNA, Genetics, Biology
TL;DR: Integrating functional genomic annotations enhances genomic language model representations, enabling effective distillation to a lightweight, annotation-free model.
Abstract: Genomic language models (GLMs) learn contextual representations of DNA sequences, but current approaches rely solely on sequence patterns without incorporating known genomic and functional annotations. To address this, we present annDNA, which involves two stages: (1) annotation-aware pre-training that creates tokens explicitly encoding functional information from GENCODE and ENCODE, and (2) cross-modal knowledge distillation that transfers these annotation-aware representations to a sequence-only model. Annotation-aware models achieve 15.5% higher AUROC than sequence-only baselines in variant effect prediction. The distilled model, with one-third of the parameters, achieves an 11.2% improvement over the sequence-only baseline while requiring only sequence input at inference. Our results demonstrate the effectiveness of using annotations during training, offering a general framework for transferring biological knowledge to sequence-only models.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 49
Loading