Included are video samples generated by our video model. The samples have 16 frames and 256x256 spatial resolution.

The training pipeline follows a three stage progressive training pipeline, which starts from a 64x64 image model, to a 16x64x64 video and then finally to a 16x256x256 video model.

Each fine consits of samples of given the sample prompt, where 1-4 videos are generated.