Keywords: Model Compression, Pruning, Contrastive Language Image Pretrain
Abstract: Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel $\textit{mapping-based}$ CLIP compression framework, $\textbf{\textit{CLIP-Map}}$. It leverages learnable matrices to map and combine pretrained weights by $\textit{Full-Mapping with Kronecker Factorization}$, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose $\textit{Diagonal Inheritance Initialization}$ to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 7305
Loading