This page contains supplementary videos for the paper "Progressive Autoregressive Video Diffusion Models".
Please click the arrows to navigate through the videos. Due to the 50MB limit of the supplementary material, we are only able to provide 3 videos per model out of total 40 videos in our long video generation benchmark and we downsize all videos to a width of 320 pixels; we keep the original frame rate of 24 FPS and the original aspect ratios.
Comparison with baselines
3 videos generated by our models and baselines.
Please refer to the main paper (Sec. 4.1, 4.2) for the meaning of each method/model name.
RW-M only has partial results.
All videos are resized to have widths of 320 pixels and the corresponding heights that follow their original aspect ratios. Open-Sora v1.2 (O), StreamingSVD, SVD-XT, and FIFO-OSP videos originally have resolutions of 424x240, 1280x720, 1024x576, and 256x256 respectively.
While downsizing the vidoes is not fair to the baselines that have higher original resolution, many of the qualitative comparison aspects, e.g. temporal consistency, motion smoothness, per-frame visual quality, and artifacts, are not affected by the downsizing.
The results below demonstrates that, using the 60-second long video generation benchmark, our method is substantially better at generating long videos with high visual quality and temporal consistency than the baseline methods.
PA-M (ours)
RW-M
PA-O-b (ours)
RN-O-b
StreamingSVD
SVD-XT
FIFO-OSP
Comparison with Sora
9 videos generated by our models and baselines.
The Sora videos are sampled from Sora's website (https://openai.com/index/sora/). They are also downsized to have widths of 320 pixels, for the same reason discussed above. The overall quality of our generated videos are comparable to Sora's while ours are much longer. The Sora videos are mostly 20 second long, with one at 8 seconds and one at 60 seconds, and all of our videos are 60 seconds.
This demonstrates that our model is advancing the length of video generation at the frontier level.
Ablation Study
Ablation studies on Chunked Latents and Overlapped Conditioning
The results below are obtained by training and inferencing the PA-M model for a similar number of training steps (less than the full training steps) and by training-free inferencing the PA-O-base model in three conditions: with both Chunked Latents and Overlapped Conditioning, with only Chunked Latents, and without both techniques.
Comparing the first and second, we can see that the model with only Chunked Latents generates videos that temporally jitters. Comparing the second and third, we can see that the model without both techniques generates videos that quickly diverges after 2.
These results show that both techniques are crucial for the model to generate high-quality videos.
PA-M full
(with Chunked Latents and Overlapped Conditioning)
PA-M with Chunked Latents
PA-M without both techniques
PA-O full
(with Chunked Latents and Overlapped Conditioning)
PA-O with Chunked Latents
Ablation study on Variable Length
Here we compare Variable Length inference results of PA-M models trained with and without Variable Length.
Without Variable Length training, the second video shows temporal jittering and abrupt scene change at the 1st and 59th seconds. This is because the model is not trained to generate the first/last chunk of latent frames to be consistent with the prior chunks.
With Variable Length training, the first video avoids the jittering and abrupt scene change at the 1st and 59th seconds, and the video is temporally smooth.
Furthermore, Variable Length inference enables the model to generate precisely 1440 frames, whereas without this technique the model would need to discard the noisy chunks remaining in the context window, which correspond to the 1441-1584th frames, when it reaches the 1440th frame.
Being able to stop the autoregressive video denoising at a precise ending frame allows our model to generate a proper ending to the video, e.g. the woman exits the camera view in the first video, which is not possible without the Variable Length technique.
PA-M full
(with Variable Length training and inference)
PA-M without Variable Length training
but with Variable Length inference