Know but can't Say: Exploring the Hidden Knowledge of Large Vision-Language Models for Fine-grained Perception

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Fine-grained Perception, Hidden Knowledge, Large Vision-Language Models
Abstract: Fine-grained perception is everywhere in the real world, but it is a challenging task for Large Vision-Language Models (LVLMs), although they have shown remarkable generalization capability. How to enhance the fine-grained perception of LVLMs, achieving generalizable fine-grained perception, has become a critical research problem. In this paper, we focus on Fine-Grained Visual Classification (FGVC), a representative task of fine-grained perception. Mainstream views attribute the poor performances to the absence of relevant knowledge, such as the appearance of a specific fine-grained category, and fine-tune LVLMs with fine-grained annotated datasets. However, due to the limited scale of datasets, these approaches face the risk of overfitting, degrading the generalization capability of LVLMs. We find out that LVLMs have already been equipped with the capabilities of FGVC, which is not reflected in the generated responses. We refer to this phenomenon as hidden knowledge, i.e., the model knows the answer, but cannot say it. The existence of hidden knowledge is verified by probing techniques on LVLMs' hidden states, which reveals a gap between the internal knowledge in parameters and the external knowledge in responses. Furthermore, our probing technique discovers a generalizable, domain-invariant pattern. By leveraging this pattern, we improve the accuracy on FGVC without using annotated data of the target domain. This improvement indicates that unleashing the hidden knowledge of LVLMs can help achieve generalizable fine-grained perception.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4667
Loading