LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed

LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed

TMLR Paper6795 Authors

03 Dec 2025 (modified: 12 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world model is either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. This paper argues that LLM-based world models can make decisions solely, but rigorous evaluations are needed. We first present the two key observations to showcase how LLM-based world models can make decisions solely, and then present the three key observations to demonstrate why current evaluation framework of LLM-based world models is not sufficient. Then, we present our suggested evaluation framework: policy verification, action proposal, and policy planning, where the world model is used for decision making solely, and finally we leverage the 31 diverse environments from (Wang et al., 2023; 2024) and curate the rule-based policy of each environment for diverse evaluations. The key findings include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, e.g., scientific tasks, ii) the performance of the LLM-based world models depends predominantly on their performance in key steps, while the total number of steps required for task completion is not a reliable indicator of task difficulty, and iii) the combination of world models’ functionalities for decision making brings unstability of the performance and partially obscures the performance gap between strong and weak models.

Submission Type: Regular submission (no more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=xxJ41g3gyk

Changes Since Last Submission: This paper is previously rejected by TMLR and we revise the paper to make it more readable and organised. Specifically, the main changes are: 1. The title is changed to clearly reflect the main contributions of this paper. Instead of using "Evaluating World Models with LLM for Decision Making", we use "LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed". The previous title is more like a benchmark, but our paper is mainly reflecting the LLM-based world models' roles in decision making, and how to evaluate them. 2. We revise the introduction section and the motivating observations. In the new version, we have the two sections to clearly illustrate how LLM-based world models can make decisions solely, and why current evaluations are not sufficient. We believe the revised sections can help reviewers and readers to understand the motivations of this paper. 3. We also revise the experiment results. Given the above two sections, we present our suggested evaluation framework, and present the experiment results, and finally conclude with the takeaways for reader to understand our points. 4. We have also updated the related work section to reflect new techniques and research published since the last submission. We hope these thorough revisions adequately address the reviewers' concerns and strengthen the manuscript for publication.

Assigned Action Editor: ~Yue_Wang16

Submission Number: 6795

Loading