LRA-QViT: Integrating Low-Rank Approximation and Quantization for Robust and Efficient Vision Transformers

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recently, transformer-based models have demonstrated state-of-the-art performance across various computer vision tasks, including image classification, detection, and segmentation. However, their substantial parameter count poses significant challenges for deployment in resource-constrained environments such as edge or mobile devices. Low-rank approximation (LRA) has emerged as a promising model compression technique, effectively reducing the number of parameters in transformer models by decomposing high-dimensional weight matrices into low-rank representations. Nevertheless, matrix decomposition inherently introduces information loss, often leading to a decline in model accuracy. Furthermore, existing studies on LRA largely overlook the quantization process, which is a critical step in deploying practical vision transformer (ViT) models. To address these challenges, we propose a robust LRA framework that preserves weight information after matrix decomposition and incorporates quantization tailored to LRA characteristics. First, we introduce a reparameterizable branch-based low-rank approximation (RB-LRA) method coupled with weight reconstruction to minimize information loss during matrix decomposition. Subsequently, we enhance model accuracy by integrating RB-LRA with knowledge distillation techniques. Lastly, we present an LRA-aware quantization method designed to mitigate the large outliers generated by LRA, thereby improving the robustness of the quantized model. To validate the effectiveness of our approach, we conducted extensive experiments on the ImageNet dataset using various ViT-based models. Notably, the Swin-B model with RB-LRA achieved a 31.8\% reduction in parameters and a 30.4\% reduction in GFLOPs, with only a 0.03\% drop in accuracy. Furthermore, incorporating the proposed LRA-aware quantization method reduced accuracy loss by an additional 0.83\% compared to naive quantization.
Lay Summary: Recently, Vision Transformer (ViT) models have demonstrated state-of-the-art performance across a wide range of visual recognition tasks, including image classification and object detection. However, their substantial parameter counts and high computational complexity pose significant challenges for deployment in resource-constrained environments such as mobile and edge devices. To address these limitations, we propose a novel compression framework that synergistically combines low-rank approximation (LRA) and quantization. Specifically, we introduce a reparameterizable branch-based low-rank approximation (RB-LRA) method in conjunction with weight reconstruction (WR) initialization to mitigate the information loss incurred during matrix decomposition. Additionally, to reduce quantization errors caused by outliers emerging from the LRA process, we develop a weight-aware distribution scaling (WADS) method tailored to the structure of compressed models. The proposed framework significantly reduces model size and inference latency on real-world mobile and edge hardware while maintaining high predictive accuracy. Furthermore, it exhibits robust generalization performance across diverse modalities, including speech and language domains. These findings suggest that the proposed approach provides a practical and effective solution for compressing transformer-based models, enabling their efficient deployment in low-resource environments.
Primary Area: Applications->Computer Vision
Keywords: Vision Transformer, Compression, Low-Rank Approximation, Quantization
Submission Number: 5998
Loading