KeyScore: Caption-Grounded Frame Scoring and Spatio-Temporal Clustering for Scalable Video–Language Understanding
Abstract: \begin{abstract}
Selecting a compact yet informative subset of frames is crucial for efficient video understanding, but existing heuristics often overlook semantic grounding and fail to generalize across tasks.
We introduce \textbf{KeyScore}, a caption-grounded frame scoring framework that integrates three cues: semantic relevance to captions, temporal distinctiveness, and contextual drop impact.
KeyScore assigns importance scores to frames that guide keyframe extractors or multimodal transformers—without any task-specific retraining.
We further propose \textbf{STACFP} (\textit{Spatio-Temporal Adaptive Clustering for Frame Proposals}), which adaptively partitions videos into diverse, non-redundant segments for compact and representative coverage.
Together, KeyScore and STACFP achieve up to \textbf{99\% frame reduction} over full-frame processing and over \textbf{70\% reduction} relative to 8-frame encoders, consistently outperforming them in \textbf{zero-shot} settings across benchmarks for video–language retrieval, keyframe extraction, and action classification.
Our approach enables efficient and transferable \textbf{zero-shot video understanding} across diverse domains. This is the first unified caption-grounded and spatio-temporal adaptive framework for zero-shot video understanding.
\end{abstract}
Loading