Reasoning Diffusion for Unpaired Text-Image to Video Generation

Zirui Pan; Xin Wang; Yipeng Zhang; Hong Chen; Kecheng Zheng; Wenwu Zhu

Reasoning Diffusion for Unpaired Text-Image to Video Generation

Zirui Pan, Xin Wang, Yipeng Zhang, Hong Chen, Kecheng Zheng, Wenwu Zhu

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation, Unpaired Text-Image Inputs, Multi-modal Large Language Model

TL;DR: This paper addresses text-image to video generation where the given image and text inputs are unpaired.

Abstract: Text-image to video generation aims to synthesize a video conditioned on the given text-image inputs. Nevertheless, existing methods generally assume that the semantic information carried in the input text and image tends to be perfectly paired and temporally aligned, occurring simultaneously in the generated video. As such, existing literature struggles with ``unpaired'' text-image inputs in the more universal and realistic scenario where i) the semantic information carried by the text and image may occur at different timestamps and ii) the condition image can appear at an arbitrary position rather than the first frame of the synthesized video. Video generation under this unpaired setting poses an urgent need to conduct reasoning over the intrinsic connections between the given textual description and referred image, which is challenging and remains unexplored. To address the challenge, in this paper we study the problem of unpaired text-image to video generation for the first time, proposing ReasonDiff, a novel model for accurate video generation from unpaired text-image inputs. Specifically, ReasonDiff designs a VisionNarrator module to harness the powerful reasoning abilities of a multi-modal large language model to analyze the conditioned unpaired text-image inputs, producing coherent per-frame narratives that temporally align them. Building upon this VisionNarrator module, ReasonDiff further introduces a novel AlignFormer module, which employs a Multi-stage Temporal Anchor Attention mechanism to predict frame-wise latent representations. These reasoning-enhanced latents are subsequently fused with the condition frame, providing structured guidance throughout the video generation process. Extensive experiments and ablation studies demonstrate that ReasonDiff significantly beats state-of-the-art baselines in terms of video generation quality with unpaired text-image inputs. For ease of illustration, the generated video samples can be found at the following address: \url{https://reasondiff.github.io/}.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 8868

Loading