Keywords: Video Generative Model, Dataset, Benchmark
TL;DR: We propose a framework featuring State-Guided Sampling that utilizes 'start' and 'end' state images to generate a synthetic dataset of physical interactions, which enhances a base model's ability to create plausible state transitions via fine-tuning.
Abstract: While recent video generative models can synthesize high-fidelity videos, they struggle to portray plausible physical interactions and the resulting state transitions, a critical bottleneck for applications in robotics and VR/AR. To address this, we introduce a framework to generate a scalable synthetic dataset of controllable interactions. Our pipeline leverages a structured taxonomy and state-of-the-art image editing models to create explicit 'start' and 'end' state images, which serve as visual anchors for the interaction. To generate a seamless video utilizing these anchors, we propose State-Guided Sampling (SGS), a novel sampling technique that mitigates artifacts common in naive conditional generation. Furthermore, we develop and validate a new automated evaluation system that aligns with human judgments to ensure data quality. Experiments show that fine-tuning a base model on our dataset significantly enhances its ability to generate plausible interactions. The dataset, code, and evaluation tools will be released.
Primary Area: generative models
Submission Number: 1425
Loading