Effective and Efficient Few-shot Fine-tuning for Vision Transformers

Published: 01 Jan 2024, Last Modified: 15 Sept 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Parameter-efficient fine-tuning (PEFT), updating only a small set of parameters either inherently in the model or additionally introduced, reduces the cost of adaptation of large vision models (e.g. Vision Transformers) and avoids overfitting to few-shot samples. However, the selection of parameters to update often follows heuristic criteria, thus lacking systematic analysis and may lead to suboptimal results. In this work, we adopt the concept of skilled parameter localization (SPL) from the NLP community, which can identify the location of task-specific parameters in a fine-tuned model automatically given any task. By applying this technique to ViTs, we observe that while the task-specific (skilled) parameters scatter in the parameter space across different tasks, the out-projection bias of attention and MLP layers are often concentrated with these skilled parameters. Inspired by this, we propose Out-projection Bias Fine-Tuning, or OBFT, a simple yet effective PEFT method that conducts few-shot adaptation solely relying on the out-projection bias of attention and MLP modules in pre-trained ViTs. We demonstrate the effectiveness and efficiency of our OBFT over 10 diverse datasets: 1) OBFT achieves superior parameter efficiency than a broad spectrum of PEFT strategies; 2) by updating only 0.01% parameters of ViTs, OBFT attains comparable performance with full fine-tuning, while significantly reducing training costs, as it does not need to maintain optimizer states for most parameters.
Loading