How Do You Watch a Movie? HourHDVC: Hour-Long Hierarchical Dense Video Captioning

Minkuk Kim; Heedong Kim; Jinyoung Moon; Jinwoo Choi; Seong Tae Kim

How Do You Watch a Movie? HourHDVC: Hour-Long Hierarchical Dense Video Captioning

Minkuk Kim, Heedong Kim, Jinyoung Moon, Jinwoo Choi, Seong Tae Kim

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dense Video Captioning, Hour-long Video Understanding, Paragraph Captioning

TL;DR: We present HourHDVC, a benchmark and model for hour-long dense video captioning that leverage scene-to-narrative structure and long-context memory, setting a new standard for coherent, paragraph-level video descriptions.

Abstract: While existing Dense Video Captioning (DVC) research has shown promise for short video clips, current approaches struggle with hour-long videos due to a critical lack of datasets that capture long-term context and models capable of managing extensive temporal dependencies. To address these challenges, we introduce Hierarchical Dense Video Captioning (HDVC), a novel task designed for long-form videos that involves both scene-level and video-level narrative captioning. For this task, we propose HourHDVC, a new dataset providing comprehensive annotations for hour-long videos. We also present LOng COntext memory-based hierarchical dense video captioning (LOCO), an end-to-end model explicitly designed to manage extensive temporal dependencies by modeling the scene-to-narrative structure inherent in HDVC. LOCO leverages a two-tier memory system, Context-aware Memory and Long-term Context Memory, to maintain narrative coherence across extended durations. Experiments on HourHDVC demonstrate that LOCO establishes strong baselines for the HDVC task, while highlighting the remaining challenges of modeling long-form video narratives.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 10868

Loading