WaveDN: A Wavelet-based Training-free Zero-shot Enhancement for Vision-Language Models

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision-Language Models (VLMs) built on contrastive learning, such as CLIP, demonstrate great transferability and excel in downstream tasks like zero-shot classification and retrieval. To further enhance the performance of VLMs, existing methods have introduced additional parameter modules or fine-tuned VLMs on downstream datasets. However, these methods often fall short in scenarios where labeled data for downstream tasks is either unavailable or insufficient for fine-tuning, and the training of additional parameter modules may considerably impair the existing transferability of VLMs. To alleviate this issue, we introduce WaveDN, a wavelet-based distribution normalization method that can boost the VLMs' performance on downstream tasks without parametric modules or labeled data. Initially, wavelet distributions are extracted from the embeddings of the sampled, unlabeled test samples. Subsequently, WaveDN conducts a hierarchical normalization across the wavelet coefficients of all embeddings, thereby incorporating the distributional characteristics of the test data. Finally, the normalized embeddings are reconstructed via inverse wavelet transformation, facilitating the computation of similarity metrics between the samples. Through extensive experiments on two downstream tasks, using a total of 14 datasets covering text-image and text-audio modal data, WaveDN has demonstrated superiority compared to state-of-the-art methods.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: Our work introduces WaveDN, a novel wavelet-based distribution normalization method for Vision-Language Models (VLMs), which significantly contributes to the field of multimedia/multimodal processing. Vision-language models play an irreplaceable role in multimedia/cross-modal tasks by aligning feature representations between text and image modalities through contrastive learning. Our proposed method can improve the performance of VLM on downstream tasks without the need for labeled data or additional parameter modules. This is of great significance for the application of VLM in downstream tasks and provides valuable references for future research. Our extensive experimental evaluations on two downstream tasks, leveraging a total of 14 datasets covering text-image and text-audio modal data, have demonstrated WaveDN's superiority over existing state-of-the-art methods. This advancement not only enhances the performance of VLMs on downstream tasks without the need for additional parameters or labeled data but also broadens the applicability of VLMs in multimedia and multimodal processing domains.
Supplementary Material: zip
Submission Number: 4671
Loading