FLAM: Scaling Latent Action World Models with Factorization

Published: 19 Sept 2025, Last Modified: 27 Oct 2025NeurIPS 2025 Workshop EWMEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Latent Action Model, World Model, Multi Agent, Representation Learning
Abstract: Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world models learning. The latent actions offer an extra degree of freedom for users to generate videos iteratively. However, existing approaches often rely on monolithic inverse and forward dynamics models to learn one latent action that controls all, which struggle to scale in complex scenes where different entities act simultaneously. In this work, we propose FLAM, a factored dynamics framework that decomposes the latent state into independent factors, each with its own inverse and forward dynamics model. This structure enables more accurate modeling of complex, multi-entity dynamics and improves the video generation quality in action-free video settings. Evaluated on Multigrid, Proc- gen, nuPlan and Sports datasets, FLAM consistently outperforms the monolithic dynamics model, demonstrating the superiority of the factorized model.
Submission Number: 74
Loading