Attention Weight Decomposition for Vision Model Compression
Keywords: Weight Pruning, Weight Decomposition, Low-rank Approximation
TL;DR: We propose UniCom, which adaptively decomposes only one side of attention projection pairs ($Q$-$K$, $V$-$O$) based on rank sensitivity, significantly improving accuracy under high compression rates compared to traditional methods.
Abstract: Weight decomposition is a practical compression approach, yet prior methods overlook that multi-head attention consists of paired linear projections ($Q$–$K$ and $V$–$O$). We propose \textit{Uni}lateral decomposition for attention \textit{Com}pression (\textit{UniCom}), which decomposes only one side of each pair to better manage approximation error while leveraging attention’s linear operations.
Since $Q$-$K$-$V$-$O$ weights exhibit different low-rank sensitivities across heads, UniCom adaptively chooses which side to decompose based on rank sensitivity. Experiments show strong generalization across vision and vision–language tasks. Under 60\% MHA reduction, UniCom improves Top-1 by +11.2\% / +5.7\% / +6.4\% over combined decomposition on DeiT-S / DeiT-S(distill) / DeiT-B, respectively.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 31
Loading