Newton - A Small Benchmark for Interactive Foundation World Models

Published: 06 Mar 2025, Last Modified: 15 Apr 2025ICLR 2025 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: world models, interactive world models, foundation models, physics, benchmark, evaluation
TL;DR: A short and sweet dataset aiming to evaluate interactivity and memory in foundation world models.
Abstract: Foundation world models (FWMs) are an emerging class of generative model that aim to generate realistic, interactive worlds from pre-training on video data. FWMs in particular promise to provide an online, stable environment for training generalist embodied agents. However, contemporary models suffer from several drawbacks, including poor object permanence, and struggle to apply physical principles consistently. Unlike large language models (LLMs) and video models, no benchmarks currently exist to specifically evaluate foundation world models' performance in the context of interactivity. We present Newton, a series of datasets and benchmarks for training and evaluating small interactive FWMs, particularly on long-context memory and physics tasks. Newton-OP includes 5,000 examples of occlusion and camera rotation, aiming to evaluate models' ability to recall objects in 3D space over long time periods. Newton-Physics additionally includes 5,000 examples of interactive rigid body physics, evaluating both action following and physical accuracy. We additionally release code to evaluate models, and demonstrate the performance of common baselines.
Submission Number: 62
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview