PuzzlePlex: A Benchmark to Evaluate the Reasoning and Planning of Large Language Models on Puzzles

Yitao Long; Tintin Jiang; Yilun Zhao; Arman Cohan; Dennis Shasha

PuzzlePlex: A Benchmark to Evaluate the Reasoning and Planning of Large Language Models on Puzzles

Yitao Long, Tintin Jiang, Yilun Zhao, Arman Cohan, Dennis Shasha

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, Puzzle, Reasoning and Planning

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in various tasks, yet their comprehensive reasoning and planning capabilities in interactive environments remain underexplored. We introduce PuzzlePlex, a benchmark designed to evaluate reasoning and planning capabilities in a multi-turn adversarial environment. PuzzlePlex comprises 24 diverse puzzles, including deterministic and stochastic games, as well as single-player and adversarial scenarios. An important novelty of our benchmark is that it includes multi-step adversarial reasoning games. To succeed in such games, each LLM must maintain a history of its own moves and those of the opponent LLM, generating strategies that outperform the opponent to secure victory. We implement customized game-playing strategies (such as dynamic programming approaches) for comparison. Our findings indicate that the reasoning and planning abilities of current LLMs are currently poor in puzzle-solving contexts. GPT-4 outperforms other models, successfully competing against customized strategies (such as greedy approaches or dynamic programming) in 49% of cases. However, when faced with strict rule sets, it demonstrates diminished reasoning and planning capabilities. In addition to the 14 multi-turn adversarial puzzles, we report on single-player puzzles and incorporate multi-modal challenges that integrate text and images, revealing that LLMs still significantly lag behind even simple heuristics in puzzles. A key feature of our benchmark is its ability to generate game instances with graduated levels of difficulty, allowing it to evolve as LLMs become more sophisticated. This adaptability ensures the continued relevance and utility of PuzzlePlex in assessing the progress of LLM capabilities in reasoning and planning within interactive environments.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7741

Loading