Behaviour-Aware Multimodal Video Summarization: Cross-Modal Integration for Human-Centric Content Analysis

Behaviour-Aware Multimodal Video Summarization: Cross-Modal Integration for Human-Centric Content Analysis

ICLR 2026 Conference Submission19272 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal video summarization, behavioural cues detection, computer vision, natural language processing, audio and speech processing, multimedia content creation, human-centric videos

TL;DR: We combine traditional behavioural cue detection with transformer-based multimodal fusion to create video summaries that capture human communicative intent across visual, audio, and textual modalities.

Abstract: Video summarization remains a challenging task in capturing the complex interplay of visual dynamics, spoken content, and behavioural cues that collectively shape viewer understanding in human-centric videos. Human communication is inherently multimodal; however, existing approaches in video summarization either rely solely on visual features or rudimentary text-visual combinations, neglecting critical audio prosodic patterns and their interactions. Crucially, the synchronous behavioural signals that convey emotional expression and communicative intent are not considered entirely. In this paper, we present a behaviour-aware multimodal framework for video summarization that explicitly models synchronized behavioural cues across visual, audio, and textual modalities through a transformer-based architecture with cross-modal attention mechanisms. Our approach integrates CLIP visual embeddings enhanced with facial movement detection and emotional transitions, HuBERT audio features enriched with prosodic patterns including pitch variations and voice quality measures, and RoBERTa textual embeddings that preserve narrative flow and discourse structure. We employ heuristic-based behavioural cue detection methods combined with large language model-guided extractive summarization to generate pseudo-ground truth references that capture both semantic importance and behavioural salience. Extensive evaluations on the ChaLearn First Impressions dataset demonstrate substantial improvements over state-of-the-art methods, achieving a 33.2% increase in F1-score over CLIP-It and 7.3% over recent multimodal approaches. Comprehensive ablation studies confirm the effectiveness of behavioural cue integration, with each modality contributing complementary insights for capturing communicatively significant moments in interview-style videos.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 19272

Loading