The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Understanding

Yuwen Tan; Yuan Qing; Boqing Gong

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Understanding

Yuwen Tan, Yuan Qing, Boqing Gong

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Large Language Model, Hierarchical Visual Understanding

Abstract: This paper reveals that many open-source language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual understanding (e.g., recognizing $\texttt{Anemone Fish}$ but not $\texttt{Vertebrate}$). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency more than the vision LLMs'. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge. Code: https://shorturl.at/sLZol.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 12120

Loading