TL;DR: This paper leverages the Canonical Polyadic Decomposition to define a parameter efficient fine tuning process for vision transformers.
Abstract: Modern methods for fine-tuning a Vision Transformer (ViT) like Low-Rank Adaptation (LoRA) and its variants demonstrate impressive performance. However, these methods ignore the high-dimensional nature of Multi-Head Attention (MHA) weight tensors. To address this limitation, we propose Canonical Rank Adaptation (CaRA). CaRA leverages tensor mathematics, first by tensorising the transformer into two different tensors; one for projection layers in MHA and the other for feed-forward layers. Second, the tensorised formulation is fine-tuned using the low-rank adaptation in Canonical-Polyadic Decomposition (CPD) form. Employing CaRA efficiently minimizes the number of trainable parameters. Experimentally, CaRA outperforms existing Parameter-Efficient Fine-Tuning (PEFT) methods in visual classification benchmarks such as Visual Task Adaptation Benchmark (VTAB)-1k and Fine-Grained Visual Categorization (FGVC).
Lay Summary: To fine-tune the Vision Transformer (ViT), retraining the entire network has proven to be resource-intensive regarding computations and memory. To address this limitation, a method called LoRA introduced a more efficient approach by training only low-rank parts of the model. While resource-effective, its design limits the effectiveness in performance to two-dimensional matrices. It doesn’t fully capture the complex, multi-dimensional nature of ViT. The multi-dimensional nature of ViT stems from the parallel attention computation blocks in the design. This work proposes Canonical Rank Adaptation (CaRA) to cater to the higher-dimensional nature of ViT. It first tensorises the attention and feed-forward layers in ViT into two multi-dimensional tensors. Then, CaRA fine-tunes them using Canonical Polyadic Decomposition (CPD), enabling low-rank updates across multiple dimensions. Experiments show that CaRA outperforms existing fine-tuning methods, including LoRA, on challenging benchmarks like the Visual Task Adaptation Benchmark-1k and Fine-Grained Vision Classification. Interestingly, it does so with significantly fewer trainable parameters, further helping to reduce the environmental cost of adapting vision models.
Primary Area: Deep Learning->Other Representation Learning
Keywords: CaRA, Canonical Polyadic Decomposition, CPD, Tensor methods, ViT, LoRA
Submission Number: 324
Loading