Keywords: Benchmark, World Models, Procedural Planning
TL;DR: A benchmark to evaluate the high-level world modeling and long-horizon procedural planning capabilities of models in high-level human-centric activities
Abstract: World models predict future world states resulting from actions, enabling AI agents to perform planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different models. In contrast to prior works that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-
PP) from a set of counterfactual distractors. As such, to prevent models from exploiting low-level continuity cues in background scenes, we provide “action equivalents” – identical actions observed in different contexts – as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, which ensures better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieves 57% accuracy on High-level World Modeling and 38% on Long-horizon Procedural Planning whereas humans are able to perfectly solve both tasks.
Submission Number: 11
Loading