Towards Human-Like Event Boundary Detection in Unstructured Videos through Scene-Action Transition

Towards Human-Like Event Boundary Detection in Unstructured Videos through Scene-Action Transition

ICLR 2026 Conference Submission17479 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: event boundary detection, video understanding, unsupervised learning, temporal modeling, egocentric vision

Abstract: Event segmentation research in psychology shows that humans naturally parse continuous activity into meaningful episodes by detecting boundaries marked by changes in perceptual features (e.g., motion) and conceptual features (e.g., goals, context). These boundaries structure episodic memory, enabling recall and prediction. Motivated by this cognitive process, we address the problem of segmenting long, unstructured video into semantically coherent episodes suitable for autonomous agents.While Generic Event Boundary Detection (GEBD) has been applied to video summarization, action localization, and surveillance, existing methods largely emphasize fine-grained, motion-driven boundaries. Such approaches struggle in real-world settings where an embodied agent must continuously structure its sensory stream: short, fragmented boundaries lead to unstable and incoherent episodic memory. We propose a zero-shot, unsupervised event segmentation framework designed for streaming, real-time perception. A key innovation is a backward-looking temporal windowing mechanism, inspired by the structure of human episodic memory, which compares the present to the recent past to determine when an event has ended. This avoids reliance on unavailable future frames and reduces false boundaries from intra-scene motion (e.g., camera panning). At a second level, we integrate perceptual changes with scene graphs, audio cues, and caption semantics to group low-level transitions into coherent episodes. Experiments on long-form egocentric video (Ego4D-MQ) and ADL scenarios demonstrate that our method aligns well with human-queried moments, outperforming post-processing strategies and motion-dominant GEBD baselines. By prioritizing semantic coherence over superficial discontinuities, our approach provides a scalable foundation for episodic memory in cognitive agents, bridging insights from human perception with machine video understanding.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 17479

Loading