Keywords: Video Understanding, Multimodal Large Language Models, Test-Time Scaling
Abstract: Current Multimodal Large Language Models (MLLMs) may struggle with tasks requiring deep logical reasoning about video content, primarily stemming from the feed-forward processing nature, which limits their ability for self-correction and iterative refinement. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors MLLM forward processes. It collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring training or additional components. Experiments demonstrate significant improvements on complex reasoning benchmarks: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, on other reasoning-focused benchmarks, our method shows consistent gains of 4.6% on the multiple-choice question section of MMVU and 2.4% on MMR-V, highlighting its robustness in enhancing logical reasoning for video understanding. The code will be released to support further research.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6845
Loading