Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.
Below, we present a qualitative comparison of video generation results across three models: Ours, MAGI, and SkyReels-V2. Each row corresponds to a different prompt, with videos generated by each model displayed side by side for comparison.
Prompt: A woman walks down the street and smiles, she puts on sunglasses and keeps walking, she stops and waves at the camera, then turns back and walks away.
Prompt: An elderly couple walks hand in hand in the park. They chat and smile as they stroll. The man feeds the woman a small treat. The camera zooms in on their happy laughter.
Prompt: A young woman types on her laptop in a coffee shop, she takes a sip and checks her schedule, receives a message and smiles, then closes her computer to leave.
Prompt: A little girl sits by the window on a rainy day, she draws shapes on the foggy glass, her mother brings her hot chocolate, together she watches the rain..
Prompt: A young man jogs around a peaceful lake at dawn, he stops to catch his breath and stretch, he takes a photo of the sunrise, then continues running with determination.
Below, we showcase long-horizon video generation results for robotic arm tasks, demonstrating complex manipulation sequences driven solely by text prompts. Our approach generates these videos entirely from instructions, without relying on action guidance, unlike prior methods that focus on short-term reconstruction accuracy in simple scenes.
Prompt: A robot arm picks up a blue cup from the sink area and places it on a tray, then picks up an orange cup places it on a tray, then picks up the green one places it on a tray.
Prompt: The robotic arm moves downward, approaching drawer and open it, then the robotic arm moves up, then the robotic arm approaches the green can, picks it up and puts it in the drawer, finally it approaches the black bowl, grips and puts it in the drawer.