MORSE: A Suite of Programmatically Controllable Multimodal Reasoning Environments with Steerable Difficulty
Keywords: Multi-modal reasoning, Video understanding
Abstract: Despite rapid progress in vision language models, current multimodal reasoning \textit{development pipelines} are limited by static training imagery, narrow task diversity, and benchmark saturation. We present \textbf{MORSE} (Multimodal Reasoning Suite), a programmatically controlled collection of \textit{video} reasoning environments with \textit{steerable difficulty} and \textit{verifiable reasoning traces and answers}. The suite comprises: (i) MORSE-${\infty}$, a simulator that produces unlimited, difficulty steerable instances with reasoning traces; (ii) \textbf{MORSE-500}, a curated benchmark of 500 challenging videos covering six complementary reasoning categories and designed to retain headroom as models improve; and (iii) \textbf{MORSE-Agent}, which automates generation and curation to reduce human effort over time. Instances are produced via deterministic Python scripts (Manim, Matplotlib, MoviePy), generative video models, and curated real footage, exposing explicit controls over visual complexity, distractors, and temporal span. On \textbf{MORSE-500}, the strongest state of the art system achieves 23.6\%, with a large gap to human performance at 55.4\%, highlighting persistent deficits in abstract and planning categories. We release code, data, seeds, and the evaluation harness to support transparent, reproducible, and forward looking multimodal reasoning research.Despite rapid progress in vision language models, current multimodal reasoning \textit{development pipelines} are limited by static training imagery, narrow task diversity, and benchmark saturation. We present \textbf{MORSE} (``Multimodal Reasoning Suite''), a programmatically controlled collection of \textit{video} reasoning environments with \textit{steerable difficulty} and \textit{verifiable reasoning traces and answers}. The suite comprises: (i) \textbf{MORSE-}${\infty}$, a simulator that produces unlimited, difficulty steerable instances with reasoning traces; (ii) \textbf{MORSE-500}, a curated benchmark of 500 challenging videos covering six complementary reasoning categories and designed to retain headroom as models improve; and (iii) \textbf{MORSE-Agent}, which automates generation and curation to reduce human effort over time. Instances are produced via deterministic Python scripts (Manim, Matplotlib, MoviePy), generative video models, and curated real footage, exposing explicit controls over visual complexity, distractors, and temporal span. On \textbf{MORSE-500}, the strongest state of the art system achieves 23.6\%, with a large gap to human performance at 55.4\%, highlighting persistent deficits in abstract and planning categories. We release code, data, seeds, and the evaluation harness to support transparent, reproducible, and forward looking multimodal reasoning research.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 10104
Loading