Improving Thinking Process in Visual Grounding via Free Thinking Rewards

Published: 28 Jan 2026, Last Modified: 30 Jan 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: We study egocentric visual intention grounding, where an assistant must infer and localize the object implied by a first-person view and an intention sentence without explicit object naming. Existing approaches either use a two-stage reasoning–then-grounding pipeline or apply reinforcement learning (RL) to train think-then-answer VLMs, but they optimize only IoU-based rewards, which often leads to reward hacking, improving box accuracy while neglecting reasoning quality. We introduce a label-free thinking-process reward that needs no human Chain-of-Thought labels or teacher models; it scores each sampled reasoning trace by how much it increases the likelihood of the correct answer, thus favoring reasoning that truly supports prediction. We also propose a data filtering strategy that selects informative easy-to-medium samples for RL using rollout error rate and reward variance. Together, these form a general recipe for process-aware RL finetuning of vision-language assistants for egocentric intention grounding. Our method achieves new state-of-the-art results on EgoIntention, boosting Precision@0.5 by +3.2 (3B) and +2.1 (7B) over strong Qwen2.5-VL baselines, and generalizes zero-shot to RefEgo-Int with +10.2 (3B) and +7.5 (7B).
Loading