Keywords: PEFT, Adapter, transfer learning, ViT, computer vision
TL;DR: ELSE introduces a theoretically grounded, ultra-lightweight Adapter for Vision Transformers, achieving efficient and stable fine-tuning by leveraging zero-initialization and training dynamics analysis.
Abstract: Inspired by Parameter-Efficient fine tuning (PEFT) methods in natural language processing, numerous efforts have been made in seeking lightweight plug-in modules to adapt Vision Transformer (ViT) to downstream applications. However, the majority of such endeavor is motivated from the architecture design point of view, while neglects the training dynamics of fine tune in terms of efficiency and stability. In contrast, this study aims to investigate how fine tune affects architecture design, and turns out a lightweight module by theoretical and experimental parsing of fine tune. As observed by us, the parameter initialization of fine tune has a significant influence on the training stability and efficiency. We notice that initializing all the parameters of Adapter to zeros can make the fine tuning start approximately from the original ViT with desired training dynamics, due to the universal representation of ViT learnt from big data. Our theoretical deduction further shows that such initialization will cause gradient vanishing of Adapter, making the large portion of it inactive, which gives rise to the opportunity of simplifying the Adapter into an extremely lightweight equal form, that is, a single learnable vector instead of the full Adapter. To this end, an Extremely Lightweight adapter of Simple Expression (ELSE) can be approached. In the experiments, ELSE achieves superior or comparable transfer learning performance with less than 0.07% of the model’s parameters fine tuned while retaining the plug-and-play flexibility.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 7285
Loading