Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

TMLR Paper5904 Authors

15 Sept 2025 (modified: 01 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{view refinement} and \emph{description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment. Our code is available at: \url{https://anonymous.4open.science/r/BiFTA-TMLR-Re-submission}.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=qG8vstoyyr&referrer=%5Bthe%20profile%20of%20Yuhao%20Sun%5D(%2Fprofile%3Fid%3D~Yuhao_Sun8)
Changes Since Last Submission: **Hyperparameter Sensitivity Analysis** We carefully addressed the Action Editor’s concern regarding the lack of in-depth hyperparameter sensitivity analysis. Specifically, we conducted extensive ablation studies on all hyperparameters to examine the robustness of our proposed framework. The results are presented in Section 5 and Appendices 3 and 5. In addition, we revised the writing to make our contributions clearer and introduced a new motivation figure (Fig. 2) to visually support our framework design. **Expanded Experiments and Alternative Methods** In response to reviewers’ concerns about experimental performance and completeness, we conducted additional experiments on different VLMs. Moreover, as suggested, we explored alternative refinement strategies and provided a detailed comparison and analysis against our proposed Bi-refinement method. This strengthens both the empirical validity and the comprehensiveness of our study. **Revised Paper Structure and Clarity** We reorganized the structure of the paper so that the main content now includes all key experimental results. The Introduction, Method, and Experiment sections were carefully rewritten to clearly present the motivation, contributions, and underlying principles of our proposed method. We believe the revised version provides a more coherent and accessible presentation. **Generalization of BiFTA beyond WCA-style framework** Based on the Action Editor’s concern regarding the generalization of our framework to other WCA-style approaches, we conducted a comprehensive literature review on subsequent developments of WCA. Our review indicates that no further advancements in WCA frameworks have been proposed. Nevertheless, prior score-based zero-shot classification methods, such as CuPL and CLIP-D, provide evidence of the generalization capability of our proposed BiFTA framework.
Assigned Action Editor: ~Jose_Dolz1
Submission Number: 5904
Loading