Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck

Yuwen Tan; Yuan Qing; Boqing Gong

Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck

Yuwen Tan, Yuan Qing, Boqing Gong

07 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Large Language Model, Hierarchical Visual Understanding

Abstract: This paper reveals that many state-of-the-art large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual understanding (e.g., recognizing $\texttt{Anemone Fish}$ but not $\texttt{Vertebrate}$). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect to some extent because the VQA tasks improve the LLM's hierarchical consistency more than the vision LLM's. We conjecture that one cannot make vision LLMs understand visual concepts fully hierarchical until LLMs possess corresponding taxonomy knowledge.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 9515

Loading