Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds

Paper ID #1249

Supplementary Material

 

 

 

 


Comparisons

We provide video comparison results of our method with other methods in the paper.
For LTX-Video, we follow the diffuser example to generation samples using the same prompt.
LTX-Video[1] CogVideoX-2B[2] Wan2.1-1.3B[3]

Ours

Prompt: 3D animation of a small, round, fluffy creature with big, expressive eyes explores a vibrant, enchanted forest. The creature, a whimsical blend of a rabbit and a squirrel, has soft blue fur and a bush.
Prompt: A cat sitting at a grand piano, elegantly playing a classical piece with its paws.
Prompt: A corgi vlogging itself in tropical Maui.
Prompt: A movie trailer featuring the adventures of the 30-year-old spaceman wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors.
Prompt: A pair of lovebirds preening each other's feathers.
Prompt: A skeleton wearing a flower hat and sunglasses dances in the wild at sunset.

 

 

 


More Results

In this section, we provide more results of our model.
Below, we provide more results of our mobile model.

 

 

 


Mobile Demo on iPhone 16 Pro Max

 

 

 

References

[1] Yoav HaCohen et al. "LTX-Video: Realtime Video Latent Diffusion." https://arxiv.org/abs/2501.00103 (2024).

[2] Yang, Zhuoyi, et al. "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer." ICLR (2025).

[3] Team Wan, et al. "Wan: Open and Advanced Large-Scale Video Generative Models." https://arxiv.org/abs/2503.20314 (2025)