VideoDiT:

Bridging Image Diffusion Transformers for Streamlined Video Generation

Anonymous authors

Approach

Approach


Illustration of our proposed DP-VAE. The video is compressed by separately encoding key frames and residuals, which are then combined to form the 3D latent variable z. The latent variable z is decoded using the 3D decoder to reconstruct the video. Additionally, the latent variable z is decoded through the original decoder for regularization.

Approach


Illustration of 2D and 3D attention mechanisms (left) and their corresponding generated results (right) using a pre-trained image diffusion transformer.

Without 3D Global Attention

With 3D Global Attention
Generated videos using a pre-trained image diffusion transformer with 2D spatial attention and with 3D global attention. It indicates that converting original 2D spatial attention into 3D global attention enables effective initialization for video generation without any additional parameters.

Experimental Results


🌟 Video Reconstruction 🌟


Ground Truth

Open-Sora-Plan

Open-Sora

CV-VAE

Ours


🌟 Video Generation 🌟



🌟 Image Generation 🌟


Approach
Approach
Approach
Approach

Approach
Approach
Approach
Approach

Approach
Approach
Approach
Approach

Approach
Approach
Approach
Approach

Approach
Approach
Approach
Approach

🌟 More Results of Video Generation 🌟