Rethinking Scale: How Multi-Agent Collaboration Enables Smaller Models to Rival GPT-4 in Video Understanding

ICLR 2026 Conference Submission15188 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Understanding, Large Language Model, Agent
Abstract: The rapid development of large language models (LLMs) has brought new perspectives to the field of video understanding. However, existing methods often rely on large-scale proprietary models, such as GPT-4, to achieve competitive performance. This paper challenges the notion that scale is the primary driver of capability by introducing RIVAL, a framework demonstrating how multi-agent collaboration enables smaller open-source models (72B or fewer) to rival their large-scale counterparts. RIVAL consists of two key components: a Multi-stage React Planner (MSRP) for structured stepwise reasoning and Multi-agent Debate Refinement (MADR) for collaborative answer generation. MSRP enhances instruction-following through precise control, while MADR improves answer quality via multi-perspective debate. Using a 72B model, our framework sets a new state-of-the-art on the EgoSchema subset with 66.8\% accuracy, surpassing prior GPT-4 based methods by 6.6\%. Furthermore, we demonstrate that even smaller open-source models (0.6B to 32B) across the Qwen 2.5 and 3 series achieve competitive performance with RIVAL. We also demonstrate competitive performance on the Next QA benchmark. Highlighting its efficiency, RIVAL can process over 28 hours of continuous video input using limited computational resources.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15188
Loading