Video-KTR: Reinforcing Video Reasoning via Key Token Attribution

ICLR 2026 Conference Submission15586 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Reasoning, Modality-aware Attribution, Reinforcement Learning, Multimodal Large Language Models
TL;DR: Video-KTR applies modality-aware token-level RL for video reasoning, reinforcing only visual, temporal, and uncertain tokens. It boosts accuracy and interpretability, reaching 42.7% on Video-Holmes (above GPT-4o) with broad benchmark gains.
Abstract: Reinforcement learning (RL) has shown strong potential for enhancing reasoning in multimodal large language models (MLLMs), yet existing video reasoning methods often rely on coarse sequence-level rewards or single-factor token selection. Such approaches neglect fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, limiting both accuracy and interpretability. We propose Video-KTR, a modality-aware policy shaping framework that performs selective, token-level RL by combining three attribution signals: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose causal and temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. By reinforcing only the union of key tokens, Video-KTR focuses learning on semantically informative, modality-sensitive content while filtering out low-value tokens. Across five challenging benchmarks, Video-KTR achieves state-of-the-art or highly competitive results—42.7% on Video-Holmes, surpassing GPT-4o—with consistent gains on both reasoning-centric and general video understanding tasks. Ablation studies verify the complementary roles of the attribution signals and the robustness of targeted token-level updates. Overall, Video-KTR improves accuracy and interpretability, offering a simple, drop-in extension to RL for complex video reasoning.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15586
Loading