User-Controllable Dense Video Captioning: A Large-Scale Benchmark and Framework

Seungho Shin; Sung Jin Um; Gyeong-Moon Park; Jung Uk Kim

User-Controllable Dense Video Captioning: A Large-Scale Benchmark and Framework

Seungho Shin, Sung Jin Um, Gyeong-Moon Park, Jung Uk Kim

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dense Video Captioning, User-Control, Synthetic Data Generation

TL;DR: We introduce a new dataset and framework for user-controllable dense video captioning. Our work allows users to dynamically control event density and caption depth, overcoming the fixed, single-style captions of existing benchmarks.

Abstract: Dense video captioning (DVC) aims to generate temporally localized captions for multiple events in untrimmed videos. Despite recent advances, existing methods still generate fixed captions because existing benchmarks provide only single-style annotations and methods for handling variations in event granularity and caption specificity remain unexplored. To address this gap, we present User-Controllable Captions (UC Captions), a new dataset with annotations that vary in event density (i.e., how frequently events are detected) and caption depth (i.e., the level of descriptive detail for a given event). This dataset is the first in DVC to explicitly encode controllable dimensions of annotation, establishing a foundation for studying user-driven flexibility. Building on this, we propose User-Controllable DVC (UC-DVC), a framework that incorporates user-defined density and depth parameters to dynamically adjust event localization and caption generation. Extensive experiments demonstrate that UC-DVC flexibly adapts to diverse user requirements while maintaining competitive performance on standard benchmarks. To support further research, both UC Captions and UC-DVC code will be publicly released after review.

Primary Area: datasets and benchmarks

Submission Number: 7294

Loading