Constructing Variable-depth ViTs from Repeatable Low-bit Learngene

Chuanbao Gao; Shuxia Lin; Xin Geng; Xu Yang

Constructing Variable-depth ViTs from Repeatable Low-bit Learngene

Chuanbao Gao, Shuxia Lin, Xin Geng, Xu Yang

16 Sept 2025 (modified: 19 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Learngene, Low-rank adapters for fine-tuning, Weight Sharing, Quantization

TL;DR: Use Repeatable Low-bit Learngene to efficiently construct variable-depth Vision Transformers for downstream tasks.

Abstract: Large-scale Vision Transformers (ViTs) have achieved remarkable success across a wide range of computer vision tasks. However, fine-tuning and deploying them in diverse real-world scenarios remains challenging, as resource constraints demand models of different scales. The recently proposed Learngene paradigm mitigates this issue by extracting compact, transferable modules from well-trained ancestor models to initialize variable-scaled descendant models. Yet, existing Learngene methods mainly treat learngenes as initialization modules for descendant models, without addressing how to construct these models more efficiently. In this work, we rethink the Learngene methodology from the perspectives of quantization and parameter repetition. We introduce Repeatable Low-bit Learngene (RELL), which compresses ancestor knowledge into a small set of quantized, cross-layer shared modules via quantization-aware training and knowledge distillation. These repeatable low-bit modules enable flexible construction of descendant models with varying depths through parameter replication, while requiring only lightweight adapter tuning for downstream adaptation. Extensive experiments demonstrate that RELL achieves superior parameter efficiency and competitive or better performance compared with existing Learngene methods.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 6723

Loading