Less is More: Label-Guided Efficient Summarization of Procedural Videos

Shreya Rajpal; Michal Golovanevsky; Carsten Eickhoff

Less is More: Label-Guided Efficient Summarization of Procedural Videos

Shreya Rajpal, Michal Golovanevsky, Carsten Eickhoff

Published: 21 Jun 2026, Last Modified: 21 Jun 2026ACL-SELVA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Multimodal Learning; Video Summarization; Vision-Language Models; Frame Selection; Sustainable AI; Efficient Inference; Procedural Video Understanding; Zero-shot Video Understanding; Semantic Frame Filtering; Long-form Video Processing

Abstract: Video summarization transforms long videos into concise representations that are easier to document and analyze, especially in high-stakes domains such as surgical training. However, long-form videos often require dense frame processing or supervised training pipelines, which can be computationally expensive and may still miss important procedural content. We present PRISM: Procedural Representation via Integrated Semantic and Multimodal Analysis, a zero-shot and training-free framework that summarizes procedural videos efficiently. PRISM selects fewer than 5% of video frames while retaining over 84% of the semantic content and improving over the baselines by up to 7.5%. Rather than exhaustively processing frames, PRISM uses lightweight visual filtering and dynamically generated procedural labels as semantic anchors to select meaningful frame-label groups. This selective inference design preserves key actions, transitions, and contextual details while reducing the number of visual inputs passed to downstream vision-language captioning stages. We evaluate PRISM on YouCook2 and ActivityNet Captions, with additional studies on keyframe selection benchmarks and surgical video datasets. Across procedural and domain-specific video tasks, PRISM achieves strong semantic alignment and precision, suggesting that efficient multimodal video understanding can be achieved by grounding generation in dynamically generated semantic anchors.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 21

Loading