GS-World: An Engine-driven Learning Paradigm for Pursuing Embodied Intelligence using World Models of Generative Simulation
Abstract: As a pivotal direction driving artificial intelligence into the physical world, embodied intelligence is drawing great research attention across academia and industries; yet the scaling law driving the success of many large AI models in the past decade has not been observed in pursuing embodied intelligence, partly due to the scarcity of multi-modal, heterogeneous, and physics-related data required for learning it.
In this perspective paper, we analyze the reasons behind and derive an efficiency law that is more demanded in this context.
To meet the law, we first propose world models of generative simulation (GS-World) that are expected to model and predict the world dynamics in a perfectly physics-accurate manner, by generative learning of physics simulation, including the 3D assets, the environments, and the physical rules governing their dynamic interactions. Based on such a GS-World engine, we propose an efficient, engine-driven learning paradigm for pursuing embodied intelligence, which is characterized by an automated pipeline of data generation and streaming, training of vision-language-action models, model verification, and model deployment, termed as Engine-driven, Sim2Real VLA. Task-oriented embodiments can also be optimized backwardly, given the differentiable nature of the learning pipeline.
We will showcase the paradigm by releasing a prototype engine of GS-World and automatically trained Sim2Real VLAs. We call for the collective community contributions to this promising umbrella of research fields.
Loading