TL;DR: Towards Video Generation Models as World Simulators
Abstract: Recent advancements in predictive models have demonstrated exceptional capabilities in predicting the future state of objects and scenes. However, the lack of categorization based on inherent characteristics continues to hinder the progress of predictive model development. Additionally, existing benchmarks are unable to effectively evaluate higher-capability, highly embodied predictive models from an embodied perspective. In this work, we classify the functionalities of predictive models into a hierarchy and take the first step in evaluating World Simulators by proposing a dual evaluation framework called WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and Implicit Manipulative Evaluation, encompassing human preference assessments from the visual perspective and action-level evaluations in embodied tasks, covering three representative embodied scenarios: Open-Ended Embodied Environment, Autonomous, Driving, and Robot Manipulation. In the Explicit Perceptual Evaluation, we introduce the HF-Embodied Dataset, a video assessment dataset based on fine-grained human feedback, which we use to train a Human Preference Evaluator that aligns with human perception and explicitly assesses the visual fidelity of World Simulater. In the Implicit Manipulative Evaluation, we assess the video-action consistency of World Simulators by evaluating whether the generated situation-aware video can be accurately translated into the correct control signals in dynamic environments. Our comprehensive evaluation offers key insights that can drive further innovation in video generation models, positioning World Simulators as a pivotal advancement toward embodied artificial intelligence.
Lay Summary: Predictive models are becoming increasingly powerful at forecasting how objects and environments evolve over time. Yet, it's still unclear how to systematically measure their capabilities—especially when these models are used in physically grounded settings like robotics or autonomous driving. Traditional benchmarks often fail to capture the full spectrum of skills needed for real-world, embodied prediction.
To address this, we introduce WorldSimBench, a benchmark designed to evaluate “World Simulators”—models that generate future world states visually and physically. We categorize predictive model functionalities into a structured hierarchy and propose a two-part evaluation framework: Explicit Perceptual Evaluation, which measures how realistic the generated videos are using human feedback, and Implicit Manipulative Evaluation, which tests how well these videos can drive real-world actions in tasks like robot control or navigation.
By combining human preference alignment and task-grounded performance, WorldSimBench provides a holistic view of predictive model quality. It sets a foundation for building more general, reliable, and physically grounded AI systems capable of seeing, predicting, and acting in the real world.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Applications->Computer Vision
Keywords: Embodied Vision, World Model,Dataset and Benchmark
Submission Number: 6145
Loading