Improving LLM Video Understanding with 16 Frames Per Second

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $\leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information. Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (*e.g.*, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro. Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. We will release the source code, model checkpoints, and data at [https://github.com/bytedance/F-16](https://github.com/bytedance/F-16).
Lay Summary: Human vision naturally processes continuous motion, but most AI video models only analyze a few still frames per second, missing important visual details. To address this, we developed F-16, a new AI model that can understand videos at a much higher frame rate—16 frames per second. F-16 compresses visual information from each second of video, allowing it to capture motion and key details more effectively without needing much more computing power. Tests show that F-16 performs better than previous models on various video understanding tasks, including general and detailed benchmarks, as well as complex activities like sports. It even beats leading commercial models like GPT-4o and Gemini 1.5 Pro in analyzing fast-paced sports like basketball and diving.
Primary Area: Applications->Computer Vision
Keywords: Multi-modal large language models, high-frame-rate video understanding, video LLM
Submission Number: 5923
Loading