Can Language Models Be Used in Multistep Commonsense Planning Domains?

Published: 01 Jan 2023, Last Modified: 12 May 2025AGI 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Transformer-based language models have recently been the focus of much attention, due to their impressive performance on myriad natural language processing (NLP) tasks. One criticism when evaluating such models on problems such as commonsense reasoning is that the benchmarking datasets may not be challenging or global enough. In response, task environments involving some kind of multistep planning, have emerged as a more stringent, and useful, evaluation paradigm. ScienceWorld is one such environment that has weaker dependence on language itself (compared to core commonsense reasoning). In the original publication, ScienceWorld problems proved difficult to solve even for a reasonably advanced language model. This paper demonstrates that, while true for the hardest version of the problem, even first-generation models like BERT can achieve good performance on many interesting intermediate problems within ScienceWorld. Our results, in addition to proposing a more practical methodology and metrics for evaluating language models on multistep planning domains involving commonsense reasoning, also suggest that language models are still likely to be an essential component of (rather than completely orthogonal to) a more comprehensive approach.
Loading