ReasonDiff: Reasoning-based Video Generation under Unpaired Text-Image Conditions

Anonymous Authors

Abstract

Text-image to video generation aims to synthesize a video conditioned on the given text-image inputs. Nevertheless, existing methods generally assume that the semantic information carried in the input text and image tends to be perfectly paired and temporally aligned, occurring simultaneously in the generated video. As such, existing literature struggles with ``unpaired'' text-image inputs in the more universal and realistic scenario where i) the semantic information carried by the text and image may occur at different timestamps and ii) the condition image can appear at an arbitrary position rather than the first frame of the synthesized video. Video generation under this unpaired setting poses an urgent need to conduct reasoning over the intrinsic connections between the given textual description and referred image, which is challenging and remains unexplored. To address the challenge, in this paper we study the problem of unpaired text-image to video generation for the first time, proposing ReasonDiff, a novel model for accurate video generation from unpaired text-image inputs. Specifically, ReasonDiff designs a VisionNarrator module to harness the powerful reasoning abilities of a multi-modal large language model to analyze the conditioned unpaired text-image inputs, producing coherent per-frame narratives that temporally align them. Building upon this VisionNarrator module, ReasonDiff further introduces a novel AlignFormer module, which employs a Multi-stage Temporal Anchor Attention mechanism to predict frame-wise latent representations. These reasoning-enhanced latents are subsequently fused with the condition frame, providing structured guidance throughout the video generation process. Extensive experiments and ablation studies demonstrate that ReasonDiff significantly beats state-of-the-art baselines in terms of video generation quality with unpaired text-image inputs. The generated video samples can be found in \url{https://reasondiff.github.io/}.

Method

Overview of the proposed ReasonDiff model, which consists of two key components: (1) the MLLM-Driven Multi-frame Reasoner, and (2) the Reasoning-Guided Generative Model. The generative model operates under the guidance of the multi-modal reasoning results generated by the reasoner.

Generated Samples

"A man opens the window." (The wind will blow the pages) "A dog running on the road." (The dog will run past the bicycle)



"A cat plays in the room.." (The table leg is broken by the cat) "A soccer flying.." (The window is broken by the soccer)



"A cat plays in the room.." (The vase is broken by the cat) "A car running on the road.." (The car will run past the dog)



"A person opens the window." (The wind will blow the candle out) "A man whistles.." (The dog runs happily to the man.)



Qualitative Comparison

Condition Image
A person opens the window.
ReasonDiff
Dynamicrafter LTX-Video CogVideoX Wan2.1
Condition Image
A cat plays in the room.
ReasonDiff
Dynamicrafter LTX-Video CogVideoX Wan2.1
Condition Image
A soccer flying.
ReasonDiff
Dynamicrafter LTX-Video CogVideoX Wan2.1
Condition Image
A dog running on the road.
ReasonDiff
Dynamicrafter LTX-Video CogVideoX Wan2.1



Quantitative Comparison (Complete tables)

Table 1. Quantitative results on the self-constructed ActivityNet dataset.

 Models   Imaging Quality   Motion Smooth   Dynamic Degree   CLIP-Text   CLIP-Image   User Rank 
Dynamicrafter 0.492(0.111) 0.979(0.019) 0.484(0.499) 0.202(0.057) 0.508(0.087) 2.871(1.239)
LTX-Video 0.398(0.081) 0.977(0.008) 0.734(0.442) 0.211(0.051) 0.544(0.084) 3.307(1.259)
CogVideoX 0.507(0.086) 0.949(0.023) 0.872(0.089) 0.197(0.039) 0.537(0.078) 4.384(1.041)
Wan2.1 0.512(0.103) 0.980(0.023) 0.810(0.280) 0.224(0.056) 0.518(0.079) 2.692(1.079)
ReasonDiff 0.528(0.106) 0.986(0.048) 0.936(0.244) 0.261(0.061) 0.528(0.082) 1.743(1.044)


Table 2. Quantitative results on MSR-VTT dataset.

 Models   Imaging Quality   Motion Smooth   Dynamic Degree   CLIP-Text   CLIP-Image   User Rank 
Dynamicrafter 0.517(0.123) 0.984(0.017) 0.440(0.496) 0.201(0.043) 0.526(0.091) 3.179(1.189)
LTX-Video 0.406(0.087) 0.986(0.010) 0.695(0.460) 0.206(0.037) 0.588(0.078) 4.051(0.971)
CogVideoX 0.552(0.082) 0.970(0.015) 0.688(0.323) 0.177(0.025) 0.572(0.059) 3.256(1.481)
Wan2.1 0.560(0.111) 0.962(0.023) 0.665(0.183) 0.191(0.036) 0.552(0.075) 2.743(1.140)
ReasonDiff 0.571(0.109) 0.984(0.028) 0.673(0.470) 0.214(0.044) 0.572(0.092) 1.769(1.245)