Bridging Image Diffusion Transformers for Streamlined Video Generation
Anonymous authors
Approach
Illustration of our proposed DP-VAE. The video is compressed by separately encoding key frames and residuals, which are then combined to form the 3D latent variable z. The latent variable z is decoded using the 3D decoder to reconstruct the video. Additionally, the latent variable z is decoded through the original decoder for regularization.
Illustration of 2D and 3D attention mechanisms (left) and their corresponding generated results (right) using a pre-trained image diffusion transformer.
Without 3D Global Attention
With 3D Global Attention
Generated videos using a pre-trained image diffusion transformer with 2D spatial attention and with 3D global attention. It indicates that converting original 2D spatial attention into 3D global attention enables effective initialization for video generation without any additional parameters.