Abstract: The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre-training) are used to model an agent's behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better', we show that the same types of power laws found in language modeling also arise in world modeling and imitation learning (e.g. between loss and optimal model size). However, the coefficients of these laws are heavily influenced by the tokenizer, task \& architecture -- this has important implications on the optimal sizing of models and data.
Lay Summary: Bigger neural networks, more data, and more computing power have been shown to make “embodied” AI (e.g. robots or game-playing agents) work better. In particular, researchers have shown this when models learn from recorded examples: either how an expert acts (imitation learning) or how its world behaves (world modeling). This paper digs into exactly how scale helps. It shows that the same neat, straight-line “power-law” trends seen in language models—where error drops predictably as models grow—also appear in these embodied-AI tasks.
But the exact slopes of those lines change a lot depending on three practical details:
1) Tokenizer: how raw data are broken into pieces.
2) Task: whether the agent is learning to copy actions or to predict its environment.
3) Model design: the neural-network architecture used.
Knowing this lets engineers trade-off model and dataset sizes more wisely instead of blindly assuming “bigger is always better.”
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: world modeling, imitation learning, scaling laws
Submission Number: 9297
Loading