Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

Anonymous

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone

Abstract: Zero-shot fine-grained image classification poses significant challenges for vision language models (VLMs), primarily due to the subtle distinctions among closely related classes. This paper introduces CascadeVLM, a cascading framework that seamlessly integrates CLIP with large vision language models (LVLMs), harnessing the strengths of both models in addressing fine-grained image classification. Our methodology involves two primary steps. Initially, CLIP is employed to identify potential class candidates based on prediction confidence. Then, LVLMs are adopted for zero/few-shot prediction, focusing on these candidate classes. Empirical evaluations on four fine-grained image classification benchmarks demonstrate CascadeVLM's superior performance compared to individual models. For example, on the StanfordCars dataset, CascadeVLM achieves an impressive 85.6\% zero-shot accuracy. Further efficiency analysis uncovers a trade-off between inference speed and prediction accuracy, and error analysis indicates that failed samples primarily stem from LVLMs' prediction errors, even when provided with the correct candidate class options.

Paper Type: long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: NLP engineering experiment, Approaches to low-resource settings

Languages Studied: English

0 Replies

Loading