Apollo: An Exploration of Video Understanding in Large Multimodal Models

Published: 01 Jan 2025, Last Modified: 13 Nov 2025CVPR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), what drives their video perception remains poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models and limited open research hinder the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models process over 1-hour videos efficiently, with the 3B parameter variant outperforming most existing 7B models. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.
Loading