VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

Published: 2024, Last Modified: 09 Jan 2026CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent Large Language Models (LLMs) have been en-hanced with vision capabilities, enabling them to compre-hend images, videos, and interleaved vision-language con-tent. However, the learning methods of these large multi-modal models (LMMs) typically treat videos as predeter-mined clips, rendering them less effective and efficient at handling streaming video inputs. In this paper, we pro-pose a novel Learning-In- Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time dialogue within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training ob-jective designed to perform language modeling for contin-uous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming di-alogue format, and (3) an optimized inference pipeline to speed up interactive chat in real-world video streams. With our LIVE framework, we develop a simplified model called VideoLLM-online and demonstrate its significant advan-tages in processing streaming videos. For instance, our VideoLLM-online-7B model can operate at over 10 FPS on an A100 GPU for a 5-minute video clip from Ego4D narration. Moreover, VideoLLM-online also showcases state-of-the-art performance on public offline video bench-marks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at showlab.github. iolvideollm-online.