Abstract: Recent years have witnessed increasing interest in image-text contrastive modeling, exemplified by models like CLIP, which have been widely used for zero-shot classification and image-text retrieval.
In this paper, we propose TernaryCLIP, a lightweight computational framework that converts both the vision and text encoders of CLIP into ternary-weight formats.
TernaryCLIP incorporates quantization-aware training and ternarization-aware distillation modules from a full-precision CLIP, enabling low-cost and high-efficiency computing.
Comprehensive experiments across 41 real-world datasets demonstrate that TernaryCLIP achieves up to a 16× storage reduction, 60\% sparsity, and 2.3× inference acceleration while maintaining competitive accuracy on zero-shot image classification and image-text retrieval tasks.
Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal pretraining, image text matching, cross-modal application
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 5555
Loading