VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce **VideoJAM**, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn *a joint appearance-motion representation*. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce **Inner-Guidance**, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
Lay Summary: Text-based video generation is an intensely studied problem in modern computer vision. However, even large-scale proprietary models trained on millions—or even billions—of high-quality videos still struggle to faithfully model the temporal axis. This often results in severe temporal incoherence: limbs appearing or disappearing arbitrarily, objects defying basic physical laws, or extreme distortions. In this work, we investigate this prominent issue and find that it can be attributed to the pixel reconstruction training objective commonly used to train these models. Intuitively, pixel reconstruction favors appearance-based features—such as colors, shapes, and outlines—over temporal features like motion. As a result, models tend to display limited sensitivity to temporal information. To address this issue, we propose a framework dubbed **VideoJAM**. VideoJAM requires the model to explicitly learn temporal information by incorporating a motion-based objective. During training, we modify the loss to predict not only pixels but also their corresponding motion. This compels the model to represent temporal information, which is necessary to reconstruct the motion signal. At inference time, we introduce **Inner-Guidance**, a mechanism wherein the motion signal predicted by the model is used to guide generation toward temporally coherent results. This allows the model to steer itself using predictions from previous generation steps. We benchmark our models against their "base" counterparts, which do not employ our framework, as well as against a range of state-of-the-art proprietary models such as OpenAI’s Sora and Kling. In all cases, our framework significantly improves motion coherence without compromising other aspects of generation, such as aesthetic quality and prompt alignment.
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Video generation, Motion understanding, Diffusion models
Submission Number: 1047
Loading