TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human Video Dataset, Multi-shot Video Generation, Speech Video Generation, Controllable Video Synthesis, Camera Shot Planning, Multimodal Generation
TL;DR: We introduce TalkCuts, a large-scale dataset for multi-shot speech video generation, and demonstrate its utility through a simple LLM-guided generation baseline.
Abstract: In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality 1080P human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.
Croissant File: json
Dataset URL: https://kaggle.com/datasets/f6e549a12ebd5ee185dc27247602d6e3828b772a68bae1f080587a6b84fafbbd
Code URL: https://github.com/UMass-Embodied-AGI/TalkCuts
Supplementary Material: pdf
Primary Area: Datasets & Benchmarks for applications in computer vision
Submission Number: 1091
Loading