Keywords: DNA classification, Large Language Model, DNA Foundation Model, Byte-level LLM, Distillation
TL;DR: We show that DNA-specific models and LLMs capture complementary signals, and safely unify them through confidence-guided distillation to enhance DNA classification.
Abstract: DNA sequence modeling has advanced with specialized foundation models such as HyenaDNA, yet these models capture only partial genomic cues. In this work, we investigate whether large language models (LLMs)—both subword-tokenized (LLaMA) and byte-level (EvaByte)—provide complementary perspectives when applied to DNA classification. Through experiments on the Human Enhancer Cohn benchmark, we find that DNA-pretrained models and LLMs succeed on largely disjoint subsets of data, revealing genuine cross-family complementarity. Building on this insight, we propose a confidence-guided distillation framework that aggregates supervision only from correct and confident teachers, producing soft labels that safely transfer diverse knowledge. Our method consistently improves both compact DNA-specific models and large byte-level LLMs, achieving gains of up to +1.90 accuracy points while remaining robust against overfitting even under near-perfect training accuracy. These findings highlight that DNA and language models encode orthogonal yet synergistic representations, and that principled distillation can unify them into a single model for robust genomic prediction.
Submission Number: 84
Loading