TernaryCLIP: Efficient Multimodal Distillation with Ternary Quantization

ACL ARR 2025 May Submission5555 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent years have witnessed increasing interest in image-text contrastive modeling, exemplified by models like CLIP, which have been widely used for zero-shot classification and image-text retrieval. In this paper, we propose TernaryCLIP, a lightweight computational framework that converts both the vision and text encoders of CLIP into ternary-weight formats. TernaryCLIP incorporates quantization-aware training and ternarization-aware distillation modules from a full-precision CLIP, enabling low-cost and high-efficiency computing. Comprehensive experiments across 41 real-world datasets demonstrate that TernaryCLIP achieves up to a 16× storage reduction, 60\% sparsity, and 2.3× inference acceleration while maintaining competitive accuracy on zero-shot image classification and image-text retrieval tasks. Our work highlights the feasibility of extreme quantization for large multimodal models, supporting effective and efficient deployment on resource-constrained devices.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, cross-modal pretraining, image text matching, cross-modal application
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 5555
Loading