Reinforced Visual Perception with Tools

ACL ARR 2025 May Submission2814 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose \method to enhance multi-modal LLMs’ abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of seven visual tools. Our explorative results across models ranging from 3B to 7B parameters show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK, and BLINK-Hard, significantly outperforming supervised and text-based RL finetuning baselines. We hope our explorative on RL-based visual tool-usage can bring insights to the community.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: RL, Tool-usage, Multimodal, Reasoning
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 2814
Loading