Abstract: We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation. Project Page: [https://weiyana.github.io/Perception-in-Reflection](https://weiyana.github.io/Perception-in-Reflection)
Lay Summary: AI systems that interpret images and generate text often make mistakes, such as describing objects that are not present or overlooking important visual details. These issues are especially common in models that combine vision and language, which struggle to align what they see with how they describe it.
To address this, we developed a method called RePer that helps vision-language models reflect on their own errors and improve through step-by-step corrections. RePer learns from feedback, adjusting its responses over multiple rounds, much like how people revise their thinking. This approach trains models to better align their visual focus with human attention and generate more accurate image descriptions. We also introduced a new benchmark that evaluates whether a model’s understanding of images matches how humans perceive them.
In both automated and human evaluations, RePer consistently outperforms existing models. Our work shows that adding reflection and feedback to AI systems can significantly enhance their reliability and interpretability. This brings us closer to building AI tools that see and describe the world in ways people can understand and trust.
Link To Code: https://github.com/weiyana/Perception-in-Reflection-ICML2025
Primary Area: Deep Learning->Large Language Models
Keywords: Perception, Reflection, LVLMs
Submission Number: 766
Loading