PhysLang: a Small Diagnostic Framework for Language-Grounded World Modeling
Keywords: World Models, Sequence Models, State Space Models, Mamba
TL;DR: PhysLang is a diagnostic benchmark showing that current neural world models treat language as a weak prior, not a causal constraint, and fail under compositional changes and contradictory rules despite low short-horizon prediction error.
Abstract: World models that understand and respond to natural language instructions are a critical step toward intelligent agents operating in human environments. However, it remains unclear whether current neural world models truly ground language in physical dynamics or merely exploit superficial correlations. We present PhysLang, a small, diagnostic benchmark designed for falsification rather than performance maximization, evaluating language grounding in neural world models within a controlled rigid-body physics simulation.
PhysLang consists of 2,000 short-horizon episodes governed by explicit natural language rules (e.g., “red blocks are heavy”), with evaluation splits targeting compositional recombination and logical contradiction under fixed dynamics. We evaluate five sequence architectures: MLP, GRU, LSTM, Transformer, and Mamba, on predicting future physical states conditioned on language.
Our experiments reveal: (i) systematic interactions between architectural inductive biases and language-conditioned dynamics, with recurrent models exhibiting lower short-horizon prediction error; (ii) extreme sensitivity to linguistic perturbations across all architectures (50–296\% error increase when language is shuffled), indicating reliance on fragile correlations; and (iii) consistent failure under contradictory rules, where models interpolate between conflicting constraints rather than resolve them.
Together, these findings suggest that contemporary neural world models encode language as a weak contextual prior rather than as a causal constraint on physical dynamics. PhysLang exposes concrete failure modes in language-grounded world modeling and provides a controlled testbed for studying grounded reasoning beyond benchmark performance. Upon acceptance, we will release the benchmark and code.
Submission Number: 128
Loading