Concept Mediation Enables Robust Fine-Grained Visual Understanding

TMLR Paper9521 Authors

05 Jun 2026 (modified: 10 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large vision-language models exhibit strong general multimodal understanding, yet training-free prompting strategies often fail on fine-grained visual recognition, where correct predictions depend on subtle and localized visual attributes. Existing approaches such as chain-of-thought reasoning and in-context learning often produce fluent explanations or contextual cues without reliably grounding decisions in discriminative visual evidence. To address this issue, we introduce Concept-Mediated In-Context Learning (CM-ICL), a training-free prompting strategy that first extracts visual attribute concepts from the input image and then uses them as structured context for classification. Without training the model, CM-ICL provides an explicit intermediate representation that re-expresses image-derived cues for fine-grained decision making. To evaluate the extracted concepts without manual concept annotations, we combine promptable-segmentation-based perceptual grounding metrics with task-coupled diagnostics that examine how visual localizability relates to downstream prediction behavior. Experiments on six fine-grained datasets show that CM-ICL improves accuracy over training-free approaches, produces more concise and visually localizable concepts, and substantially reduces generation failures. The results demonstrate that concept mediation provides an effective and interpretable route for training-free fine-grained visual recognition.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Candace_Ross1
Submission Number: 9521
Loading