Synthesizing world models for bilevel planning

TMLR Paper4692 Authors

17 Apr 2025 (modified: 11 Jun 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Modern reinforcement learning (RL) systems have demonstrated remarkable capabilities in complex environments, such as video games. However, they still fall short of achieving human-like sample efficiency and adaptability when learning new domains. Theory-based reinforcement learning (TBRL) is an algorithmic framework specifically designed to address this gap. Modeled on cognitive theories, TBRL leverages structured, causal world models---``theories''---as forward simulators for use in planning, generalization and exploration. Although current TBRL systems provide compelling explanations of how humans learn to play video games, they face several technical limitations: their theory languages are restrictive, and their planning algorithms are not scalable. To address these challenges, we introduce TheoryCoder, an instantiation of TBRL that exploits hierarchical representations of theories and efficient program synthesis methods for more powerful learning and planning. TheoryCoder equips agents with general-purpose abstractions (e.g., ``move to''), which are then grounded in a particular environment by learning a low-level transition model (a Python program synthesized from observations by a large language model). A bilevel planning algorithm can exploit this hierarchical structure to solve large domains. We demonstrate that this approach can be successfully applied to diverse and challenging grid-world games, where approaches based on directly synthesizing a policy perform poorly. Ablation studies demonstrate the benefits of using hierarchical abstractions.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We thank all the reviewers for their valuable feedback and comments. The summary of changes is as follows: - **[Reviewer Yy8F]** Added several relevant baselines including: - Comparison to EMPA in section 4.3 in Table 3. and Figure 6. - WorldCoder (another world model learning approach) in section 4.3 in Table 7. - Fast Downward-only (a pure planning approach) in section 4.3 in Table 4. - **[Reviewer Yy8F]** Added section 4.9 that discusses training data contamination for GPT4o and our approach to testing on an environment that could not have been seen during training - We show results in section 4.9 Figure 15 for a newly designed game - **[Reviewer UsbX]** Added section 4.6 which compares the behavior of GPT-o1 and TheoryCoder along with Figure 13. - **[Reviewer UsbX & Reviewer 3KZ9]** Added section 4.7 which provides an analysis of robustness and failure modes of our system on more challenging, accompanied by Figure 14. - Shows where and why our system has substantial room for improvement and suggests what changes should be made - **[Reviewer 3KZ9]** Improved readability of related works section by adding headers - **[Reviewer UsbX]** Added token costs in addition to API costs for all new experiments Writing has been further condensed in the first three sections of this revision.
Assigned Action Editor: ~Devendra_Singh_Dhami1
Submission Number: 4692
Loading