VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Video

ICLR 2026 Conference Submission3317 Authors

09 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Reasoning, Video Question Answering, Mathematical Understanding, Temporal Reasoning, Visual Grounding
TL;DR: Mathematical Reasoning using Video MLLM Benchmark
Abstract: Mathematical reasoning in real-world video presents a fundamentally different challenge than static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos from 10 seconds to over 1 hour. We employ graduate-level experts to ensure high quality, for over 920 man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we establish an evaluation framework for models that must reason, rather than merely perceive, jointly ground concepts across visual, audio, and textual modalities, across temporally extended mathematical problem settings.
Primary Area: datasets and benchmarks
Submission Number: 3317
Loading