Zero-to-Interaction: Generating Dynamic Videos from Synthetic State Transitions

Jiho Jang; Jin-Young Kim; Nojun Kwak; Kyungjune Baek

Zero-to-Interaction: Generating Dynamic Videos from Synthetic State Transitions

Jiho Jang, Jin-Young Kim, Nojun Kwak, Kyungjune Baek

03 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generative Model, Dataset, Benchmark

TL;DR: We propose a framework featuring State-Guided Sampling that utilizes 'start' and 'end' state images to generate a synthetic dataset of physical interactions, which enhances a base model's ability to create plausible state transitions via fine-tuning.

Abstract: While recent video generative models can synthesize high-fidelity videos, they struggle to portray plausible physical interactions and the resulting state transitions, a critical bottleneck for applications in robotics and VR/AR. To address this, we introduce a framework to generate a scalable synthetic dataset of controllable interactions. Our pipeline leverages a structured taxonomy and state-of-the-art image editing models to create explicit 'start' and 'end' state images, which serve as visual anchors for the interaction. To generate a seamless video utilizing these anchors, we propose State-Guided Sampling (SGS), a novel sampling technique that mitigates artifacts common in naive conditional generation. Furthermore, we develop and validate a new automated evaluation system that aligns with human judgments to ensure data quality. Experiments show that fine-tuning a base model on our dataset significantly enhances its ability to generate plausible interactions. The dataset, code, and evaluation tools will be released.

Primary Area: generative models

Submission Number: 1425

Loading