Text-Based Games as a Challenging Benchmark for Large Language Models

Qinyue Tan; Ashkan Kazemi; Rada Mihalcea

Text-Based Games as a Challenging Benchmark for Large Language Models

Qinyue Tan, Ashkan Kazemi, Rada Mihalcea

01 Mar 2023 (modified: 31 May 2023)Submitted to Tiny Papers @ ICLR 2023Readers: Everyone

Keywords: text-based games, large language models, benchmarks, reinforcement learning

TL;DR: Text-based games are a challenging benchmark for large language models

Abstract: Text-based games (TBG) are puzzle-solving, interactive dialogue language tasks that have the potential to become a challenging intelligence benchmark for large language models (LLMs). TBGs are similar to interactive dialogue, as they require the capability for bidirectional communication in natural language, while at the same time being straightforward to evaluate in terms of performance, as a score clearly indicates progress in TBGs. We conduct preliminary experiments on FLAN-T5, Turing, and OPT language models to test their puzzle-solving abilities using an \textit{easy} TBG called ``Detective''. Our results suggest that LLMs underperform in comparison with state-of-the-art and human performance. We discuss the potential reasons behind the performance gap, such as the complexity of turning TBGs into prompts, LLMs not learning from past trials, their lack of memory, and LLMs relying on statistical prediction instead of goal orientation.

5 Replies

Loading