Keywords: text-based games, large language models, benchmarks, reinforcement learning
TL;DR: Text-based games are a challenging benchmark for large language models
Abstract: Text-based games (TBG) are puzzle-solving, interactive dialogue language tasks that have the potential to become a challenging intelligence benchmark for large language models (LLMs). TBGs are similar to interactive dialogue, as they require the capability for bidirectional communication in natural language, while at the same time being straightforward to evaluate in terms of performance, as a score clearly indicates progress in TBGs. We conduct preliminary experiments on FLAN-T5, Turing, and OPT language models to test their puzzle-solving abilities using an \textit{easy} TBG called ``Detective''. Our results suggest that LLMs underperform in comparison with state-of-the-art and human performance. We discuss the potential reasons behind the performance gap, such as the complexity of turning TBGs into prompts, LLMs not learning from past trials, their lack of memory, and LLMs relying on statistical prediction instead of goal orientation.
5 Replies
Loading