TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Linli Yao; Yuancheng Wei; Yaojie Zhang; Lei Li; Xinlong Chen; Feifan Song; Ziyue Wang; Kun Ouyang; Yuanxin Liu; Lingpeng Kong; Qi Liu; Pengfei Wan; Kun Gai; Yuanxing Zhang; Xu Sun

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Linli Yao, Yuancheng Wei, Yaojie Zhang, Lei Li, Xinlong Chen, Feifan Song, Ziyue Wang, Kun Ouyang, Yuanxin Liu, Lingpeng Kong, Qi Liu, Pengfei Wan, Kun Gai, Yuanxing Zhang, Xu Sun

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code are publicly available at https://github.com/yaolinli/TimeChat-Captioner.

Lay Summary: When we watch a movie, our minds effortlessly weave together what we see, what we hear, and how scenes unfold over time. But today's AI systems still struggle to describe videos with the same richness: most either summarize the whole clip in a single sentence, or capture only what is visible while ignoring sound, or lose track of when things happen. We propose a new task that asks AI to describe a video the way a screenwriter would write a script — moment by moment, covering both visual and audio details, with precise timestamps for every scene. To guide this, we designed a six-part structural template, covering elements such as setting, actions, speech, and sound, so that anyone reading the description can vividly picture the video without seeing it. We also built a human-verified evaluation benchmark, a large training dataset, and a strong open-source model trained on it. The richer, time-aligned video descriptions our system produces serve two purposes. First, they offer high-quality training material that can teach other multimodal AI models to understand videos more deeply. Second, our freely available model reaches the video-captioning quality of today's leading commercial systems (e.g., Google's Gemini-2.5-Pro) from major tech companies, while remaining fully open and free for researchers, educators, and the broader community to use, study, and build upon.

Link To Code: https://github.com/yaolinli/TimeChat-Captioner

Primary Area: Deep Learning->Large Language Models

Keywords: Omni-Video Understanding, Audio-Visual Captioning, Structural Video Captioning, Detailed Video Captioning

Originally Submitted PDF: pdf

Submission Number: 4175

Loading