SC- Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

Published: 01 Jan 2024, Last Modified: 14 May 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Recent trends in Large Vision Language Models (LVLMs) research have been increasingly focusing on ad-vancing beyond general image understanding towards more nuanced, object-level referential comprehension. In this paper, we present and delve into the self-consistency ca-pability of LVLMs, a crucial aspect that reflects the mod-els' ability to both generate informative captions for spe-cific objects and subsequently utilize these captions to ac-curately re-identify the objects in a closed-loop process. This capability significantly mirrors the precision and reli-ability of fine- grained visual-language understanding. Our findings reveal that the self-consistency level of existing LVLMs falls short of expectations, posing limitations on their practical applicability and potential. To address this gap, we introduce a novel fine-tuning paradigm named Self-Consistency Tuning (SC-Tune). It features the syn-ergistic learning of a cyclic describer-locator system. This paradigm is not only data-efficient but also exhibits gener-alizability across multiple LVLMs. Through extensive ex-periments, we demonstrate that SC- Tune significantly ele-vates performance across a spectrum of object-level vision-language benchmarks and maintains competitive or im-proved performance on image-level vision-language bench-marks. Both our model and code will be publicly available at https://github.com/ivattyue/SC-Tune.
Loading