HAD: Hybrid Architecture Distillation for Bridging Large-Transformer Knowledge into Compact Genomic Models

HAD: Hybrid Architecture Distillation for Bridging Large-Transformer Knowledge into Compact Genomic Models

ICLR 2026 Conference Submission20412 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Genomic Language Model, Knowledge Distillation, Foundation Model

Abstract: Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and downstream fine-tuning has also achieved remarkable progress in the field of genomic sequence modeling. However, existing research often either relies on scaling up pre-training data and parameters, which brings a heavy computational burden, or lacks a systematic method to avoid the loss of prior information with compact architectures. In this work, we propose a **H**ybrid **A**rchitecture **D**istillation (**HAD**) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training. Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the invisible tokens during MLM pre-training. To validate the effectiveness of our proposed method, we conducted comprehensive experiments on the Nucleotide Transformer Benchmark and Genomic Benchmark. Compared to models with similar parameters, our model achieved excellent performance. **More surprisingly**, it even surpassed the distillation ceiling-teacher model on some sub-tasks, which is more than **500×** larger. Lastly, we conducted a comprehensive analysis of the HAD architecture, including linear probing representation evaluation, which demonstrates both the strong representation capacity of HAD and the validity of our teacher model selection for distillation. t-SNE visualization further supports these findings, providing an intuitive view of the model's representation ability.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 20412

Loading