Abstract: Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this enhanced reasoning ability often leads to increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that extended reasoning chains reduce focus on visual inputs, contributing to hallucinations. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model maintains visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark encompassing diverse multimodal tasks, designed to jointly assess the balance between reasoning ability and hallucination. Our findings indicate that larger models generally achieve a better balance between reasoning and perception, and that this balance is more influenced by the types and domains of the training data than by its volume. These insights underscore the importance of evaluation frameworks that consider both reasoning quality and perceptual reliability.
Loading