Enhancing Fine-Grained Image Classifications via Cascaded Vision Language ModelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Zero-shot fine-grained image classification poses significant challenges for vision language models (VLMs), primarily due to the subtle distinctions among closely related classes. This paper introduces CascadeVLM, a cascading framework that seamlessly integrates CLIP with large vision language models (LVLMs), harnessing the strengths of both models in addressing fine-grained image classification. Our methodology involves two primary steps. Initially, CLIP is employed to identify potential class candidates based on prediction confidence. Then, LVLMs are adopted for zero/few-shot prediction, focusing on these candidate classes. Empirical evaluations on four fine-grained image classification benchmarks demonstrate CascadeVLM's superior performance compared to individual models. For example, on the StanfordCars dataset, CascadeVLM achieves an impressive 85.6\% zero-shot accuracy. Further efficiency analysis uncovers a trade-off between inference speed and prediction accuracy, and error analysis indicates that failed samples primarily stem from LVLMs' prediction errors, even when provided with the correct candidate class options.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
0 Replies

Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview