TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code are publicly available at https://github.com/yaolinli/TimeChat-Captioner.
Lay Summary: When we watch a movie, our minds effortlessly weave together what we see, what we hear, and how scenes unfold over time. But today's AI systems still struggle to describe videos with the same richness: most either summarize the whole clip in a single sentence, or capture only what is visible while ignoring sound, or lose track of when things happen. We propose a new task that asks AI to describe a video the way a screenwriter would write a script — moment by moment, covering both visual and audio details, with precise timestamps for every scene. To guide this, we designed a six-part structural template, covering elements such as setting, actions, speech, and sound, so that anyone reading the description can vividly picture the video without seeing it. We also built a human-verified evaluation benchmark, a large training dataset, and a strong open-source model trained on it. The richer, time-aligned video descriptions our system produces serve two purposes. First, they offer high-quality training material that can teach other multimodal AI models to understand videos more deeply. Second, our freely available model reaches the video-captioning quality of today's leading commercial systems (e.g., Google's Gemini-2.5-Pro) from major tech companies, while remaining fully open and free for researchers, educators, and the broader community to use, study, and build upon.
Link To Code: https://github.com/yaolinli/TimeChat-Captioner
Primary Area: Deep Learning->Large Language Models
Keywords: Omni-Video Understanding, Audio-Visual Captioning, Structural Video Captioning, Detailed Video Captioning
Originally Submitted PDF: pdf
Submission Number: 4175
Loading