Keywords: Efficient Multimodal Learning; Video Summarization; Vision-Language Models; Frame Selection; Sustainable AI; Efficient Inference; Procedural Video Understanding; Zero-shot Video Understanding; Semantic Frame Filtering; Long-form Video Processing
Abstract: Video summarization transforms long videos into concise representations that are easier to document and analyze, especially in high-stakes domains such as surgical training. However, long-form videos often require dense frame processing or supervised training pipelines, which can be computationally expensive and may still miss important procedural content. We present PRISM: Procedural Representation via Integrated Semantic and Multimodal Analysis, a zero-shot and training-free framework that summarizes procedural videos efficiently. PRISM selects fewer than 5% of video frames while retaining over 84% of the semantic content and improving over the baselines by up to 7.5%. Rather than exhaustively processing frames, PRISM uses lightweight visual filtering and dynamically generated procedural labels as semantic anchors to select meaningful frame-label groups. This selective inference design preserves key actions, transitions, and contextual details while reducing the number of visual inputs passed to downstream vision-language captioning stages. We evaluate PRISM on YouCook2 and ActivityNet Captions, with additional studies on keyframe selection benchmarks and surgical video datasets. Across procedural and domain-specific video tasks, PRISM achieves strong semantic alignment and precision, suggesting that efficient multimodal video understanding can be achieved by grounding generation in dynamically generated semantic anchors.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 21
Loading