Incentivizing Visual Thinking Cues via Reinforcement Learning for Complex Scene Reasoning, Planning and Understanding

Jinlong Li

Incentivizing Visual Thinking Cues via Reinforcement Learning for Complex Scene Reasoning, Planning and Understanding

Jinlong Li

02 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Thinking, Reinforcement, Reasoning, Planning, Understanding

Abstract: Modern vision-language models struggle with long-horizon, compositional tasks because they lack explicit intermediate structure and often rely on post-hoc, unverifiable rationales. We propose Reinforcement Learning with Visual Thinking Cues (RL-VTC), a training framework that incentivizes models to generate compact, verifiable intermediate artifacts—visual thinking cues—such as spatial relation graphs, coarse scene sketches, affordance heatmaps, and temporal action sketches. RL-VTC couples a cue policy (that produces VTCs) with a reasoner/planner (that solves the downstream task conditioned on the cues). A task-aware verifier computes rewards that combine (i) final task success, (ii) counterfactual utility—the causal improvement when the same policy acts with vs. without the cues, (iii) temporal consistency, and (iv) parsimony to discourage gratuitous markup. We further introduce retrospective credit assignment with advantage estimates from cue ablations and off-policy relabeling to stabilize training. Across complex-scene reasoning, embodied planning, and multi-hop understanding benchmarks, RL-VTC consistently outperforms strong supervised and chain-of-thought baselines while reducing hallucinated relations and improving sample efficiency. Human and automatic fidelity checks show that learned cues are faithful (predictive under perturbations), succinct, and actionable for planning. Ablations confirm the necessity of counterfactual utility and consistency terms. Our results demonstrate that learning to think visually—by rewarding intermediate, testable structure—yields more reliable reasoning and planning in complex scenes, offering a principled path toward RL-driven interpretability in multimodal agents.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: true

Submission Guidelines: true

Anonymous Url: true

No Acknowledgement Section: true

Submission Number: 955

Loading