Keywords: world model, latent action model
TL;DR: Scaling latent action model to multi-entity domains by decomposing the latent state into independent factors, each with its own inverse and forward dynamics model
Abstract: Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world models learning.The latent actions offer an extra degree of freedom for users to generate videos iteratively.However, existing approaches often rely on monolithic inverse and forward dynamics models to learn one latent action that controls all, which struggle to scale in complex scenes where different entities act simultaneously. In this work, we propose FLAM, a factored dynamics framework that decomposes the latent state into independent factors, each with its own inverse and forward dynamics model. This structure enables more accurate modeling of complex, multi-entity dynamics and improves the video generation quality in action-free video settings. Evaluated on Multigrid, Procgen, nuPlan, Sports and EGTEA datasets, FLAM consistently outperforms the monolithic dynamics model, demonstrating the superiority of the factorized model.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10625
Loading