Self Forcing: Bridging Training and Inference in Autoregressive Video Diffusion
(Supplementary Material)


Qualitative Comparison with Relavant Baselines

We qualitatively compare our method against relavant baselines. Our method has the same speed as CausVid, with both methods being approximately 150–400× faster than the other methods in terms of latency. Our video quality is much better than CausVid, free from over-saturation artifacts and having more dynamic motion. Below, we present 5-second video samples generated from prompts in the MovieGenBench dataset.


Wan2.1-1.3B

SkyReels2-1.3B

MAGI-1-4.5B

CausVid-1.3B

Ours-1.3B

A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

A petri dish with a bamboo forest growing within it that has tiny red pandas running around.

An adorable happy otter confidently stands on a surfboard wearing a yellow lifejacket, riding along turquoise tropical waters near lush tropical islands, 3D digital render art style.



Code/Uncrated Samples

In the local folder uncurated_samples, we attach uncurated samples from our method generated from the first 64 prompts in the MovieGenBench dataset with a fixed random seed (seed = 0). We provide source code for our method in the code folder.



Internal Comparison: Distribution Matching Loss

We observe that chunk-wise Self Forcing with DMD, SiD, and GAN objectives yields qualitatively similar results, while frame-wise Self Forcing with the GAN loss performs slightly worse than with DMD or SiD, possibly due to the inherent challenges of training GANs. We use chunk-wise Self Forcing with DMD objective for other comparisons unless otherwise specified.


Self Forcing (chunk-wise, SiD)

Self Forcing (chunk-wise, GAN)

Self Forcing (chunk-wise, DMD)

Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The art style is 3D and realistic, with a focus on lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.


Self Forcing (frame-wise, SiD)

Self Forcing (frame-wise, GAN)

Self Forcing (frame-wise, DMD)

A close up view of a glass sphere that has a zen garden within it. There is a small dwarf in the sphere who is raking the zen garden and creating patterns in the sand.



Rolling KV Cache Extrapolation: Comparing With and Without Local Attention Training

We show that naively performing rolling KV cache to generate long videos leads to flickering artifacts, which can be addressed by performing local attention training.


Without Local Attention Training

With Local Attention Training

Without Local Attention Training

With Local Attention Training



Limitation: Extrapolation Quality

Although Self Forcing effectively addresses exposure bias and reduces error accumulation within the training video length, generating videos extrapolatively beyond what the model is trained on remains challenging and often leads to quality degradation. We show 30-second video samples generated by our method below. Quality degradation can be clearly observed in the latter half of the video.


This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest...

A cartoon kangaroo disco dances.

A Chinese Lunar New Year celebration video with Chinese Dragon.

3D animation of a small, round, fluffy creature with big, expressive eyes explores a vibrant, enchanted forest...