Abstract: Large-scale pre-trained visual-text embedding networks like CLIP have achieved significant progress in cross-modal information retrieval. However, adapting these models for specific scenarios often leads to high training costs and substantial memory demands. In response, we introduce Tiny-CMR, a compact framework designed for cross-modal image-text retrieval. TinyCMR effectively refines the shared semantic space created by pre-trained models, employing a unique blend of reconstruction, modality discrimination, and contrastive learning tasks. By freezing the pre-trained model and integrating a few linear layers, TinyCMR considerably cuts down both training time and space complexity. Experimental results on the MS-COCO and Flickr30K datasets show that TinyCMR achieves accuracy comparable to fine-tuned CLIP and other large-scale models, with over a 400-fold reduction in trainable parameters and a 1000-fold decrease in training time under the same GPU memory constraint. Additionally, TinyCMR proves to be an efficient tool for enhancing the performance of fine-tuned pre-trained models, offering significant improvements with minimal additional training. This positions TinyCMR as a versatile solution for adapting zero-shot models and refining fine-tuned models in cross-modal retrieval tasks.
External IDs:dblp:conf/cscwd/YuWZC24
Loading