Keywords: LVLMs, Fine-Grained Visual Recognition, Training-Free paradigm
Abstract: Recent advances in Large Vision–Language Models (LVLMs) have enabled training-free Fine-Grained Visual Recognition (FGVR). However, effectively exploiting LVLMs for FGVR remains challenging due to the inherent visual ambiguity of subordinate-level categories.
Existing methods predominantly adopt either retrieval-oriented or reasoning-oriented paradigms to tackle this challenge, but both are constrained by two fundamental limitations:
(1) They apply the same inference pipeline to all samples without accounting for uneven recognition difficulty, thereby leading to suboptimal accuracy and efficiency;
(2) The lack of mechanisms to consolidate and reuse error-specific experience causes repeated failures on similar challenging cases.
To address these limitations, we propose \textbf{SARE},
a \underline{\textbf{S}}ample-wise \underline{\textbf{A}}daptive \underline{\textbf{RE}}asoning framework for training-free FGVR.
Specifically, SARE adopts a cascaded design that combines fast candidate retrieval with fine-grained reasoning, invoking the latter only when necessary. In the reasoning process, SARE incorporates a self-reflective experience mechanism that leverages past failures to provide transferable discriminative guidance during inference, without any parameter updates.
Extensive experiments across 14 datasets substantiate that SARE achieves state-of-the-art performance while substantially reducing computational overhead.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching,vision question answering,cross-modal application
Languages Studied: English
Submission Number: 1386
Loading