TreeReasoner: Reinforcing Tool-Augmented Tree-of-Videos Reasoning

Published: 12 May 2026, Last Modified: 12 May 20262nd ViSCALE @ CVPR 2026 OralEveryoneRevisionsCC BY 4.0
Keywords: Video Understanding, Reasoning, Tool Learning
Abstract: We present TreeReasoner, a tool-augmented, tree-structured reasoning framework that recasts long-video understanding as an active hypothesis-verification problem over a vast visual search space. By maintaining multiple parallel reasoning paths, the model systematically explores the temporal dimension and, guided by intermediate hypotheses, invokes frame-level tools such as temporal zooming, temporal jumping, and sliding to incrementally search a minimal yet sufficient chain of evidence. The entire framework is trained end-to-end with Tree-of-Tool Relative Policy Optimization (ToT-RPO) following a supervised fine-tuning warmup, achieving superior video-understanding accuracy while decoding far fewer frames than existing methods and exhibiting interpretable temporal localization and causal-verification behaviors. Experiments on six long-video reasoning benchmarks show that TreeReasoner consistently outperforms both standard IO and naive tool-calling baselines. Transferability experiments on hallucination further confirm its generalization and reduced hallucination tendencies. These experiments validate the stability and efficiency of TreeReasoner in complex temporal scenarios.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 3
Loading