HAD: Hybrid Architecture Distillation for Bridging Large-Transformer Knowledge into Compact Genomic Models
Keywords: Genomic Language Model, Knowledge Distillation, Foundation Model
Abstract: Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and downstream fine-tuning has also achieved remarkable progress in the field of genomic sequence modeling.
However, existing research often either relies on scaling up pre-training data and parameters, which brings a heavy computational burden, or lacks a systematic method to avoid the loss of prior information with compact architectures.
In this work, we propose a **H**ybrid **A**rchitecture **D**istillation (**HAD**) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training.
Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the invisible tokens during MLM pre-training.
To validate the effectiveness of our proposed method, we conducted comprehensive experiments on the Nucleotide Transformer Benchmark and Genomic Benchmark. Compared to models with similar parameters, our model achieved excellent performance. **More surprisingly**, it even surpassed the distillation ceiling-teacher model on some sub-tasks, which is more than **500×** larger.
Lastly, we conducted a comprehensive analysis of the HAD architecture, including linear probing representation evaluation, which demonstrates both the strong representation capacity of HAD and the validity of our teacher model selection for distillation. t-SNE visualization further supports these findings, providing an intuitive view of the model's representation ability.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 20412
Loading