Combining Vision-Language Models and Weak Supervision for Nuanced Vision Classification Tasks

Published: 01 Jan 2025, Last Modified: 19 Sept 2025CVPR Workshops 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Nuanced-concept image classification tasks often require substantial labeled data. The labeling process for such problems is time-consuming and labor-intensive. While zero-shot methods like CLIP, Modeling Collaborator, and AdaptCLIPZS have shown promising results, they generally lack a versatile open source pipeline for domain-independent, multi-class fine-grained classification. We are proposing a classification pipeline consisting of weak supervision and open-source Vision Language Models (VLMs) to be employed in both binary and multi-class nuanced classification problems. Our proposed pipeline is domain-independent as it uses knowledge embedded in the pre-training of VLMs. This eliminates the need for additional fine-tuning for specific contexts, as required by methods such as AdaptCLIPZS. In our proposed pipeline, VLMs serve as weak labelers in the classification tasks, while a Weak Supervision (WS) model aggregates those labels and produce a set of pseudo labels (pseudo ground-truth) to train an end classifier. We have conducted multiple experiments to demonstrate the validity of the pipeline in both binary and multi-class classification tasks. The experimental results have shown that our proposed pipeline is capable of producing superior results in both binary and multi-class problems compared to the state-of-the-art zero-shot classification methods.
Loading