Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Boyang Wang; Xuweiyi Chen; Matheus Gadelha; Zezhou Cheng

Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Boyang Wang, Xuweiyi Chen, Matheus Gadelha, Zezhou Cheng

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation, Image Generation, Diffusion Model

Abstract: Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by a user-specified motion trajectory. To support this task, we introduce a new dataset that is curated semi-automatically, an efficient identity-preserving motion-controllable video Diffusion Transformer architecture, and a comprehensive evaluation protocol targeting this task. Our evaluation shows that our proposed approach significantly outperforms existing baselines.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 4402

Loading