Temporal Reasoning for Vision-Language Models via Chain of Draft

ICLR 2026 Conference Submission16606 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: chain of draft; temporal reasoning
Abstract: Large Vision Language Models (LVLMs) like Qwen VL have demonstrated remarkable capabilities in understanding and reasoning about visual content, particularly for static images. However, their application to video reasoning tasks remains computationally intensive, with significant latency and token usage when prompted using traditional Chain-of-Thought (CoT) methods. In this paper, we propose the integration of Chain of Draft (CoD) methodology with Qwen VL for efficient video reasoning. CoD is a prompting technique that encourages models to generate concise, essential intermediate thoughts rather than verbose reasoning steps. We adapt this approach specifically for video understanding tasks, demonstrating that our method achieves comparable or superior accuracy to CoT while significantly reducing token consumption (by up to 78%) and inference latency (by up to 65%). We evaluate our approach on multiple video reasoning benchmarks, including MVBench and EgoSchema, and demonstrate its effectiveness across various video understanding tasks. Our contributions include: (1) a novel adaptation of Chain of Draft for video reasoning tasks; (2) a comprehensive evaluation framework for video reasoning efficiency; (3) a theoretical analysis providing time complexity guarantees; and (4) empirical evidence of significant computational benefits without sacrificing accuracy. This work has important implications for deploying efficient video reasoning capabilities in resource-constrained environments and real-time applications.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16606
Loading