TL;DR: RIFLEx offers a true free lunch—achieving high-quality $2\times$ extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner.
Abstract: Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch—achieving high-quality $2\times$ extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables $3\times$ extrapolation by minimal fine-tuning without long videos.
Lay Summary: While AI can now create short, high-quality videos, making them significantly longer while keeping motion smooth and non-repetitive over time remains a major hurdle. Existing methods often result in awkward temporal issues like repeating actions or unnatural slowdowns.
We investigated how these AI video models handle time. Our analysis revealed that a specific internal mechanism, which we call an "intrinsic frequency," is primarily responsible for these extrapolation problems. Based on this finding, we developed a simple technique named RIFLEx that adjusts this frequency.
RIFLEx offers a straightforward way to improve long video generation. It allows advanced AI models to double video length smoothly without any extra training, essentially providing a performance boost "for free." Furthermore, with just a small amount of tuning (even without using long videos), RIFLEx can enable videos to be tripled in length while enhancing quality. This work makes generating extended, coherent video content with AI more feasible.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/thu-ml/RIFLEx
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: video diffusion transformers; video diffusion model; length extrapolation
Submission Number: 3486
Loading