Can Large Vision-Language Models Correct Grounding Errors By Themselves?

Yuan-Hong Liao; Rafid Mahmood; Sanja Fidler; David Acuna

Can Large Vision-Language Models Correct Grounding Errors By Themselves?

Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, David Acuna

27 Sept 2024 (modified: 17 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language models, self-correction, feedback

TL;DR: Inspired from mixed results of self-correction in LLMs, we explore self-correction in large vision language models.

Abstract: Enhancing semantic grounding abilities in Vision-Language Models (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore semantic grounding in VLMs through self-correction, without requiring in-domain data, fine-tuning, or modifications to the network architectures. Despite the concerns raised in the self-correction of LLMs, we find that if prompted and framed properly, VLMs can correct their own semantic grounding mistakes even without the access to the oracle feedback. We also show an identified self-correction framework in an iterative setting which consistently improves performance across all models investigated. Overall, iterative self-correction consistently improves VLM performance by up to 8.4 accuracy points across all models investigated; yet, after several rounds of feedback, strong models like GPT-4V and GPT-4o still exhibit significant error rates, indicating promising directions for further research.

Supplementary Material: pdf

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8992

Loading