Abstract: Long video understanding has emerged as a crucial capability in real-world applications
such as meeting summarization, video surveillance, educational lecture analysis, and content
moderation. However, it remains computationally prohibitive for VideoLLMs, primarily due
to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream
to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling
of up to several million tokens for LLM inference, resulting in high latency and memory
use. To address these challenges, we propose QuickVideo, a system-algorithm co-design
that substantially accelerates long video understanding to support real-time downstream
applications. It comprises three key innovations: QuickCodec, a parallelized CPU-based
video decoder that achieves 2–3× speedup by splitting videos into keyframe-aligned intervals
processed concurrently. QuickPrefill, a memory-efficient prefilling method using KV-cache
pruning to support more frames with less GPU memory; and an overlapping scheme
that overlaps CPU video decoding with GPU inference. Together, these components reduce
the time required to process a long video input by a minute, enabling fast, efficient video
understanding even on limited hardware. Experiments show that QuickVideo generalizes
across durations and sampling rates, making long video processing feasible in practice.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jia-Bin_Huang1
Submission Number: 5522
Loading