Keywords: Fine-grained visual recognition, Multimodal Large Language Models
TL;DR: Leveraging MLLM-generated weak labels to fine-tune a CLIP model for efficient zero-shot fine-grained visual recognition.
Abstract: Fine-grained Visual Recognition (FGVR) involves differentiating between visually similar categories, and is challenging due to subtle differences between the categories and the need for large, expert-annotated datasets. We observe that recent Multimodal Large Language Models (MLLMs) demonstrate potential in FGVR, but querying such models for every test input is not practical due to high costs and time inefficiencies. To address this, we propose a novel pipeline that fine-tunes a CLIP model for FGVR by leveraging MLLMs. Our approach requires only a small support set of unlabeled images to construct a weakly supervised dataset, with MLLMs as label generators. To mitigate the impact of obtained noisy labels, we construct a candidate set for each image using labels of neighboring images, thereby increasing the likelihood of having the correct label in the candidate set. We then employ a partial label learning algorithm to fine-tune a CLIP model using these candidate sets. Our method sets a new benchmark for efficient fine-grained classification, achieving comparable performance to MLLMs at just $1/100^{th}$ of the inference cost and a fraction of the time taken.
Submission Number: 157
Loading