Keywords: Explainable AI,Hierarchical reasoning,Fine-grained recognition
Abstract: Fine-grained visual recognition (FGVR) requires not only high accuracy but also human-aligned interpretability, particularly in safety-critical applications. While human cognition naturally follows a coarse-to-fine reasoning process—rapid holistic categorization for coarse-grained class followed by attention to local details for fine-grained class—existing post-hoc and ante-hoc interpretability methods fall short in capturing this hierarchy automatically. To address this gap, we propose Bi-HiR, a novel Bidirectional Hierarchical Reasoning framework that emulates human-like cognition by integrating top-down semantic reasoning with bottom-up prototype-based explanations. Specifically, Bi-HiR: (1) leverages large language model (LLM)-derived semantic priors to construct coarse-to-fine hierarchies without manual annotations; (2) introduces a joint optimization strategy where top-down priors guide bottom-up prototype learning across semantic levels; and (3) produces interpretable, step-wise visual and semantic explanations. Experiments on six FGVR benchmarks demonstrate that Bi-HiR achieves competitive SOTA performance and exhibits superior zero-shot generalization. The results also reveal the superiority of Bi-HiR’s interpretability on human trust and model error diagnose. Code is publicly available at https://github.com/duduududamax/Bi-HiR.
Primary Area: interpretability and explainable AI
Submission Number: 11535
Loading