Canonical Rank Adaptation: An Efficient Fine-Tuning Strategy for Vision Transformers

Lokesh Veeramacheneni; Moritz Wolter; Hilde Kuehne; Juergen Gall

Canonical Rank Adaptation: An Efficient Fine-Tuning Strategy for Vision Transformers

Lokesh Veeramacheneni, Moritz Wolter, Hilde Kuehne, Juergen Gall

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper leverages the Canonical Polyadic Decomposition to define a parameter efficient fine tuning process for vision transformers.

Abstract: Modern methods for fine-tuning a Vision Transformer (ViT) like Low-Rank Adaptation (LoRA) and its variants demonstrate impressive performance. However, these methods ignore the high-dimensional nature of Multi-Head Attention (MHA) weight tensors. To address this limitation, we propose Canonical Rank Adaptation (CaRA). CaRA leverages tensor mathematics, first by tensorising the transformer into two different tensors; one for projection layers in MHA and the other for feed-forward layers. Second, the tensorised formulation is fine-tuned using the low-rank adaptation in Canonical-Polyadic Decomposition (CPD) form. Employing CaRA efficiently minimizes the number of trainable parameters. Experimentally, CaRA outperforms existing Parameter-Efficient Fine-Tuning (PEFT) methods in visual classification benchmarks such as Visual Task Adaptation Benchmark (VTAB)-1k and Fine-Grained Visual Categorization (FGVC).

Lay Summary: To fine-tune the Vision Transformer (ViT), retraining the entire network has proven to be resource-intensive regarding computations and memory. To address this limitation, a method called LoRA introduced a more efficient approach by training only low-rank parts of the model. While resource-effective, its design limits the effectiveness in performance to two-dimensional matrices. It doesn’t fully capture the complex, multi-dimensional nature of ViT. The multi-dimensional nature of ViT stems from the parallel attention computation blocks in the design. This work proposes Canonical Rank Adaptation (CaRA) to cater to the higher-dimensional nature of ViT. It first tensorises the attention and feed-forward layers in ViT into two multi-dimensional tensors. Then, CaRA fine-tunes them using Canonical Polyadic Decomposition (CPD), enabling low-rank updates across multiple dimensions. Experiments show that CaRA outperforms existing fine-tuning methods, including LoRA, on challenging benchmarks like the Visual Task Adaptation Benchmark-1k and Fine-Grained Vision Classification. Interestingly, it does so with significantly fewer trainable parameters, further helping to reduce the environmental cost of adapting vision models.

Primary Area: Deep Learning->Other Representation Learning

Keywords: CaRA, Canonical Polyadic Decomposition, CPD, Tensor methods, ViT, LoRA

Submission Number: 324

Loading