StreamDiT enables real-time text-to-video generation at 16 FPS on a single GPU (H100)
(1 minute long videos)
(5 minute long video)
We applied our method to the 30B model to test its scalability (Note: StreamDiT-30B is not real-time on a single H100)
Interactive inference pipeline of StreamDiT: StreamDiT is specifically designed to achieve real-time responsiveness and interactivity, and its inference pipeline is structured accordingly. To decrease latency, the DiT denoiser, TAE (VAE) decoder, and text encoder run in separate processes. A prompt callback function operates continuously, listening for new user prompts in real time. When a user provides a new prompt, it is converted into a text embedding by the text encoders, and the embedding is sent to the DiT thread to update the existing embedding. Subsequent denoising steps then use this updated embedding through a cross-attention mechanism, dynamically adjusting the direction of text guidance. This design enables users to interactively influence and modify video content in real time through prompt inputs.
A little boy riding his bike in a garden in spring. -> A little boy riding his bike in a garden in summer. -> A little boy riding his bike in a garden in fall. -> A little boy riding his bike in a garden in winter.
A cat is walking in a garden. -> A tiger is walking in a garden.
Serene nature with a calm lake and cloudy sky in daylight. -> Quiet lake at night under a glowing moon and fading twilight. -> Fireworks exploding over a lake.
A man is walking in a desert. -> A man is walking in a cyberpunk city.
A horse is running on a grassland. -> A cheetah is running on a grassland. -> A horse is running on a grassland.
A man is walking on a desert. -> A man is walking towards a beach. -> A man is walking on a beach.
StreamDiT-4B achieves real-time performance at 16 FPS on a single GPU while maintaining competitive quality with existing methods. Our model generates 512p video streams with temporal consistency and high visual fidelity.
We implemented the existing methods in our base 4B T2V model to perform apples-to-apples comparisons with StreamDiT
Reuse and Diffuse
FIFO-Diffusion
Ours (Teacher)
Ours (Distilled)
Prompt: An old man takes a pleasant stroll in Antarctica during a beautiful sunset. The old man wears a bright green dress that reaches down to his ankles, and a wide-brimmed sun hat that shields his face from the sun. The man's skin is weathered and wrinkled, with a kind face and a gentle smile. He walks slowly and deliberately, taking in the breathtaking scenery around him. The Antarctic landscape stretches out behind him, with snow-covered peaks and ice shelves glistening in the fading light. The sky above is a kaleidoscope of colors, with hues of pink, orange, and purple blending together in a beautiful sunset. The man's shadow stretches out across the snow as he walks, with the sun casting a warm glow over the entire scene. The lighting is soft and golden, with the sunset casting long shadows across the icy landscape. The video is shot in a cinematic style.
Reuse and Diffuse
FIFO-Diffusion
Ours (Teacher)
Ours (Distilled)
Prompt: Camera tracking shot. New York City is submerged underwater like Atlantis. The city's skyscrapers and buildings are covered in coral and seaweed, with schools of fish darting in and out of the windows. A large whale swims down the middle of the street, its massive body gliding effortlessly through the water. Sea turtles and sharks of various species swim through the streets, some swimming alongside the whale. The Empire State Building and the Statue of Liberty are visible in the distance, covered in coral and anemones. The streetlights are still on, casting a warm glow over the scene. The water is a deep blue, with a few rays of sunlight filtering down from above. The fish and other sea creatures are swimming and playing in the streets, as if they have always lived there. The video is shot in a cinematic style.