Abstract: Zero-shot fine-grained image classification poses significant challenges for vision language models (VLMs), primarily due to the subtle distinctions among closely related classes. This paper introduces CascadeVLM, a cascading framework that seamlessly integrates CLIP with large vision language models (LVLMs), harnessing the strengths of both models in addressing fine-grained image classification.
Our methodology involves two primary steps. Initially, CLIP is employed to identify potential class candidates based on prediction confidence. Then, LVLMs are adopted for zero/few-shot prediction, focusing on these candidate classes.
Empirical evaluations on four fine-grained image classification benchmarks demonstrate CascadeVLM's superior performance compared to individual models. For example, on the StanfordCars dataset, CascadeVLM achieves an impressive 85.6\% zero-shot accuracy. Further efficiency analysis uncovers a trade-off between inference speed and prediction accuracy, and error analysis indicates that failed samples primarily stem from LVLMs' prediction errors, even when provided with the correct candidate class options.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
0 Replies
Loading