Keywords: Latent Action Model, World Model, Multi Agent, Representation Learning
Abstract: Learning latent actions from action-free video has emerged as a powerful paradigm
for scaling up controllable world models learning. The latent actions offer an
extra degree of freedom for users to generate videos iteratively. However, existing
approaches often rely on monolithic inverse and forward dynamics models to learn
one latent action that controls all, which struggle to scale in complex scenes where
different entities act simultaneously. In this work, we propose FLAM, a factored
dynamics framework that decomposes the latent state into independent factors,
each with its own inverse and forward dynamics model. This structure enables
more accurate modeling of complex, multi-entity dynamics and improves the video
generation quality in action-free video settings. Evaluated on Multigrid, Proc-
gen, nuPlan and Sports datasets, FLAM consistently outperforms the monolithic
dynamics model, demonstrating the superiority of the factorized model.
Submission Number: 74
Loading