SeVA: Learning to Ask Discriminative Queries for Fine-Grained Visual Recognition

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Fine-grained vision recognition, iterative reasoning, self-questioning, semantic anchors.
Abstract: Fine-grained visual recognition (FGVR) aims to distinguish categories based on subtle, localized cues. Recent methods use vision–language models to ask questions for visual hints, but typically rely on fixed templates that yield static attributions rather than adaptive, informative queries. This limits their ability to reveal discriminative features critical to fine-grained categorization. In this work, we ask a key question: how can we ask better questions that are context aware, targeted, and dynamically guide visual reasoning? We propose the Anchored Self-Questioning Vision Agent (SeVA), an iterative reasoning framework that combines a visual–question-answering model with two large language models acting as a Questioner and a Reasoner. Rather than extracting surface-level attributions, SeVA begins with a coarse prediction and then actively interrogates the image by generating discriminative, context-sensitive sub-questions. A Verifier highlights relevant regions, and the Reasoner integrates accumulated evidence to refine the prediction over multiple rounds. To ensure stable and effective interaction between these components, SeVA introduces two complementary types of semantic anchors: (i) explicit anchors from prior category names that guide early attention, and (ii) implicit anchors from previous predictions that provide a language-based gradient for progressive reasoning. Experiments on standard FGVR benchmarks demonstrate the importance of asking good questions, enabling SeVA to outperform state-of-the-art methods.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 8185
Loading