Through BabyAI Steps: Understanding and Evaluating Grounded Intelligence in LLMs

ICLR 2026 Conference Submission2066 Authors

04 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Grounded Intelligence, Reasoning, Planning, BabyAI
Abstract: Does spatial prediction translate to spatial planning in LLMs? We investigate this question through a controlled experimental test bed using a textual adaptation of the procedurally generated BabyAI grid world. Our Predict-Plan-Decompose (PPD) framework evaluates three core aspects of grounded intelligence under full observability: (1) predicting action consequences on environment state, (2) generating action sequences to achieve objectives, and (3) decomposing high-level instructions into subgoal sequences. We find a notable dissociation: while most models achieve over 80\% accuracy on spatial prediction, their performance drops to below 20\% on multi-step planning. This pattern holds across state-of-the-art models that show similar performance on mainstream benchmarks, yet reveal significant disparities in our evaluation. Under Full mission, Partial observability, Interactive (FPI) execution, performance further degrades to 10-12\% success rates in the most challenging settings. We provide a standardized evaluation framework with procedural generation enabling assessment across unlimited environment instances, reducing contamination risks while supporting dynamic evaluation through custom BabyAIBots and virtual environment execution.
Primary Area: datasets and benchmarks
Submission Number: 2066
Loading