Text-image to video generation aims to synthesize a video conditioned on the given text-image inputs. Nevertheless, existing methods generally assume that the semantic information carried in the input text and image tends to be perfectly paired and temporally aligned, occurring simultaneously in the generated video. As such, existing literature struggles with ``unpaired'' text-image inputs in the more universal and realistic scenario where i) the semantic information carried by the text and image may occur at different timestamps and ii) the condition image can appear at an arbitrary position rather than the first frame of the synthesized video. Video generation under this unpaired setting poses an urgent need to conduct reasoning over the intrinsic connections between the given textual description and referred image, which is challenging and remains unexplored. To address the challenge, in this paper we study the problem of unpaired text-image to video generation for the first time, proposing ReasonDiff, a novel model for accurate video generation from unpaired text-image inputs. Specifically, ReasonDiff designs a VisionNarrator module to harness the powerful reasoning abilities of a multi-modal large language model to analyze the conditioned unpaired text-image inputs, producing coherent per-frame narratives that temporally align them. Building upon this VisionNarrator module, ReasonDiff further introduces a novel AlignFormer module, which employs a Multi-stage Temporal Anchor Attention mechanism to predict frame-wise latent representations. These reasoning-enhanced latents are subsequently fused with the condition frame, providing structured guidance throughout the video generation process. Extensive experiments and ablation studies demonstrate that ReasonDiff significantly beats state-of-the-art baselines in terms of video generation quality with unpaired text-image inputs. The generated video samples can be found in \url{https://reasondiff.github.io/}.

Overview of the proposed ReasonDiff model, which consists of two key components: (1) the MLLM-Driven Multi-frame Reasoner, and (2) the Reasoning-Guided Generative Model. The generative model operates under the guidance of the multi-modal reasoning results generated by the reasoner.
![]() |
![]() |
|||
| "A man opens the window." | (The wind will blow the pages) | "A dog running on the road." | (The dog will run past the bicycle) | |
![]() |
![]() |
|||
| "A cat plays in the room.." | (The table leg is broken by the cat) | "A soccer flying.." | (The window is broken by the soccer) | |
![]() |
![]() |
|||
| "A cat plays in the room.." | (The vase is broken by the cat) | "A car running on the road.." | (The car will run past the dog) | |
![]() |
![]() |
|||
| "A person opens the window." | (The wind will blow the candle out) | "A man whistles.." | (The dog runs happily to the man.) | |
|
||||
| Condition Image A person opens the window. | ReasonDiff | |||
| Dynamicrafter | LTX-Video | CogVideoX | Wan2.1 | |
|
||||
| Condition Image A cat plays in the room. | ReasonDiff | |||
| Dynamicrafter | LTX-Video | CogVideoX | Wan2.1 | |
|
||||
| Condition Image A soccer flying. | ReasonDiff | |||
| Dynamicrafter | LTX-Video | CogVideoX | Wan2.1 | |
|
||||
| Condition Image A dog running on the road. | ReasonDiff | |||
| Dynamicrafter | LTX-Video | CogVideoX | Wan2.1 | |
|  Models  |  Imaging Quality  |  Motion Smooth  |  Dynamic Degree  |  CLIP-Text  |  CLIP-Image  |  User Rank  |
|---|---|---|---|---|---|---|
| Dynamicrafter | 0.492(0.111) | 0.979(0.019) | 0.484(0.499) | 0.202(0.057) | 0.508(0.087) | 2.871(1.239) |
| LTX-Video | 0.398(0.081) | 0.977(0.008) | 0.734(0.442) | 0.211(0.051) | 0.544(0.084) | 3.307(1.259) |
| CogVideoX | 0.507(0.086) | 0.949(0.023) | 0.872(0.089) | 0.197(0.039) | 0.537(0.078) | 4.384(1.041) |
| Wan2.1 | 0.512(0.103) | 0.980(0.023) | 0.810(0.280) | 0.224(0.056) | 0.518(0.079) | 2.692(1.079) |
| ReasonDiff | 0.528(0.106) | 0.986(0.048) | 0.936(0.244) | 0.261(0.061) | 0.528(0.082) | 1.743(1.044) |
|  Models  |  Imaging Quality  |  Motion Smooth  |  Dynamic Degree  |  CLIP-Text  |  CLIP-Image  |  User Rank  |
|---|---|---|---|---|---|---|
| Dynamicrafter | 0.517(0.123) | 0.984(0.017) | 0.440(0.496) | 0.201(0.043) | 0.526(0.091) | 3.179(1.189) |
| LTX-Video | 0.406(0.087) | 0.986(0.010) | 0.695(0.460) | 0.206(0.037) | 0.588(0.078) | 4.051(0.971) |
| CogVideoX | 0.552(0.082) | 0.970(0.015) | 0.688(0.323) | 0.177(0.025) | 0.572(0.059) | 3.256(1.481) |
| Wan2.1 | 0.560(0.111) | 0.962(0.023) | 0.665(0.183) | 0.191(0.036) | 0.552(0.075) | 2.743(1.140) |
| ReasonDiff | 0.571(0.109) | 0.984(0.028) | 0.673(0.470) | 0.214(0.044) | 0.572(0.092) | 1.769(1.245) |