KeyScore: Caption-Grounded Frame Scoring and Spatio-Temporal Clustering for Scalable Video–Language Understanding

Shih-Yao Lin, sibendu paul, Caren Chen

Published: 12 Nov 2025, Last Modified: 12 Nov 2025CVPREveryoneCC BY 4.0

Abstract: \begin{abstract} Selecting a compact yet informative subset of frames is crucial for efficient video understanding, but existing heuristics often overlook semantic grounding and fail to generalize across tasks. We introduce \textbf{KeyScore}, a caption-grounded frame scoring framework that integrates three cues: semantic relevance to captions, temporal distinctiveness, and contextual drop impact. KeyScore assigns importance scores to frames that guide keyframe extractors or multimodal transformers—without any task-specific retraining. We further propose \textbf{STACFP} (\textit{Spatio-Temporal Adaptive Clustering for Frame Proposals}), which adaptively partitions videos into diverse, non-redundant segments for compact and representative coverage. Together, KeyScore and STACFP achieve up to \textbf{99\% frame reduction} over full-frame processing and over \textbf{70\% reduction} relative to 8-frame encoders, consistently outperforming them in \textbf{zero-shot} settings across benchmarks for video–language retrieval, keyframe extraction, and action classification. Our approach enables efficient and transferable \textbf{zero-shot video understanding} across diverse domains. This is the first unified caption-grounded and spatio-temporal adaptive framework for zero-shot video understanding. \end{abstract}