Keywords: Natural Language Grounding, Instruction Following, Procedural Content Generation, Semantic Granularity, Structured Generation
Abstract: Natural-language-controllable procedural content generation depends critically on how linguistic concepts are grounded in structured representations. We show that widely used benchmarks rely on coarse semantic encodings that collapse distinct concepts, obscure grounding failures, and systematically inflate apparent instruction-following performance. Focusing on Super Mario level generation, we introduce MARIOPCG, a higher-fidelity dataset with expanded semantic coverage, and evaluate multiple decoder-only language models under controlled conditions. Increasing representational granularity exposes severe controllability failures in limited-capacity models that remain invisible under coarser benchmarks, while larger models exhibit stable behavior only when the representation supports meaningful grounding. These findings establish dataset semantic granularity as a necessary condition for valid evaluation of grounded language control and suggest that prior conclusions drawn from semantically collapsed benchmarks reflect representational artifacts rather than model capability. We will publicly release the dataset, prompts, and evaluation code to support reproducibility and further research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Generation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 7819
Loading