Navigating the Labyrinth: Evaluating and Enhancing LLMs’ Ability to Reason About Search Problems

Nasim Borazjanizadeh; Roei Herzig; Trevor Darrell; Rogerio Feris; Leonid Karlinsky

Navigating the Labyrinth: Evaluating and Enhancing LLMs’ Ability to Reason About Search Problems

Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris, Leonid Karlinsky

26 Sept 2024 (modified: 15 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mathematical & reasoning benchmark, Search & Combinatorial problems, A* algorithm

TL;DR: We introduce a novel benchmark of unique combinatorial problems that advanced LLMs fail to solve, and present a prompting and inference method using A* algorithm that significantly improves LLMs' performance.

Abstract: Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with multi-step reasoning which is relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique combinatorial problems that avoid training contamination (each equipped with automated pipelines to generate an arbitrary number of instances) and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that even the most advanced LLMs fail to solve these problems end-to-end in text, e.g., GPT4 and o1-preview respectively solve only 1.4% and 18.6% correctly. SearchBench problems require considering multiple pathways to the solution and backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps only slightly. We next introduce an in-context learning approach that prompts the model to implement A*, an informed search algorithm, to comprehensively traverse the problem state space, improving the performance of models. We further extend this approach and propose the Multi-Stage-Multi-Try inference method which breaks down the A* algorithm implementation into two stages and auto-verifies the first stage against unit tests, raising GPT-4's performance above 57%.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7467

Loading