Keywords: Dense Video Captioning, User-Control, Synthetic Data Generation
TL;DR: We introduce a new dataset and framework for user-controllable dense video captioning. Our work allows users to dynamically control event density and caption depth, overcoming the fixed, single-style captions of existing benchmarks.
Abstract: Dense video captioning (DVC) aims to generate temporally localized captions for multiple events in untrimmed videos. Despite recent advances, existing methods still generate fixed captions because existing benchmarks provide only single-style annotations and methods for handling variations in event granularity and caption specificity remain unexplored. To address this gap, we present User-Controllable Captions (UC Captions), a new dataset with annotations that vary in event density (i.e., how frequently events are detected) and caption depth (i.e., the level of descriptive detail for a given event). This dataset is the first in DVC to explicitly encode controllable dimensions of annotation, establishing a foundation for studying user-driven flexibility. Building on this, we propose User-Controllable DVC (UC-DVC), a framework that incorporates user-defined density and depth parameters to dynamically adjust event localization and caption generation. Extensive experiments demonstrate that UC-DVC flexibly adapts to diverse user requirements while maintaining competitive performance on standard benchmarks. To support further research, both UC Captions and UC-DVC code will be publicly released after review.
Primary Area: datasets and benchmarks
Submission Number: 7294
Loading