What Are You Doing? A Closer Look at Controllable Human Video Generation

Emanuele Bugliarello; Anurag Arnab; Roni Paiss; Christy Koh; Pieter-Jan Kindermans; Cordelia Schmid

What Are You Doing? A Closer Look at Controllable Human Video Generation

Emanuele Bugliarello, Anurag Arnab, Roni Paiss, Christy Koh, Pieter-Jan Kindermans, Cordelia Schmid

16 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, dataset, evaluation, video generation, controllable video generation, human video generation

TL;DR: A new benchmark to measure progress in controllable human video generation.

Abstract: High-quality benchmarks are crucial for driving progress in machine learning research. However, despite the growing interest in video generation, there is no comprehensive dataset to evaluate human synthesis. Humans can perform a wide variety of actions and interactions, but existing datasets, like TikTok and TED-Talks, lack the diversity and complexity to fully capture the capabilities of video generation models. We close this gap by introducing 'What Are You Doing?' (WYD): a new benchmark for fine-grained evaluation of controllable image-to-video generation of humans. WYD consists of 1,544 captioned videos that have been meticulously collected and annotated with fine-grained categories. These allow us to systematically measure performance across 9 aspects of human generation, including actions, interactions and motion. We also propose and validate an evaluation framework that leverages our annotations and reflects well human preferences. Equipped with our dataset and metrics, we perform in-depth analyses of state-of-the-art open-source models in controllable image-to-video generation, showing how WYD provides novel insights about their capabilities. We release our data and code to drive forward progress in human video generation.

Primary Area: datasets and benchmarks

Submission Number: 7795

Loading