Double-Filter: Efficient Fine-tuning of Pre-trained Vision-Language Models via Patch&Layer Filtering

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Efficient Fine-tuning of Pre-trained Vision-Language Models via Patch and Layer Filtering
Abstract: In this paper, we present a novel approach, termed Double-Filter,to “slim down” the fine-tuning process of vision-language pre-trained (VLP) models via filtering redundancies in feature inputs and architectural components. We enhance the fine-tuning process using two approaches. First, we develop a new patch selection method incorporating image patch filtering through background and foreground separation, followed by a refined patch selection process. Second, we design a genetic algorithm to eliminate redundant fine-grained architecture layers, improving the efficiency and effectiveness of the model. The former makes patch selection semantics more comprehensive, improving inference efficiency while ensuring semantic representation. The latter’s fine-grained layer filter removes architectural redundancy to the extent possible and mitigates the impact on performance. Experimental results demonstrate that the proposed Double-Filter achieves superior efficiency of model fine-tuning and maintains competitive performance compared with the advanced efficient fine-tuning methods on three downstream tasks, VQA, NLVR and Retrieval. In addition, it has been proven to be effective under METER and ViLT VLP models.
Lay Summary: Modern AI models pre-trained on both images and text (vision-language models) are powerful but often inefficient to adapt for specific tasks. Our new method, Double-Filter, streamlines this process by cutting unnecessary computations in two ways: Adaptive Patch Selection: Unlike methods that focus only on foreground objects, our approach dynamically evaluates both foreground and background regions, retaining only the most semantically meaningful patches. This ensures comprehensive image understanding while avoiding redundant processing of unimportant details. Optimized Architecture: Using an automated search inspired by evolution (genetic algorithm), we identify and remove redundant layers in the model, making it leaner while preserving accuracy. Tests on three key tasks—visual question answering (VQA), visual reasoning (NLVR), and image-text retrieval—show that that Double-Filter makes fine-tuning faster and more efficient, with little to no drop in performance. It works across different popular models, offering a practical way to reduce costs and energy use in AI applications. Why it matters: By intelligently filtering data and architecture, our method helps deploy adaptable AI systems more sustainably—balancing speed, cost, and accuracy.
Primary Area: Deep Learning->Other Representation Learning
Keywords: Efficient Pre-trained Vision-Language Model, Patch Redundancy Filter, Architecture Redundancy Filter
Submission Number: 6164
Loading