Keywords: Mathematical & reasoning benchmark, Search & Combinatorial problems, A* algorithm
TL;DR: We introduce a novel benchmark of unique combinatorial problems that advanced LLMs fail to solve, and present a prompting and inference method using A* algorithm that significantly improves LLMs' performance.
Abstract: Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with multi-step reasoning which is relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique combinatorial problems that avoid training contamination (each equipped with automated pipelines to generate an arbitrary number of instances) and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that even the most advanced LLMs fail to solve these problems end-to-end in text, e.g., GPT4 and o1-preview respectively solve only 1.4% and 18.6% correctly. SearchBench problems require considering multiple pathways to the solution and backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps only slightly. We next introduce an in-context learning approach that prompts the model to implement A*, an informed search algorithm, to comprehensively traverse the problem state space, improving the performance of models. We further extend this approach and propose the Multi-Stage-Multi-Try inference method which breaks down the A* algorithm implementation into two stages and auto-verifies the first stage against unit tests, raising GPT-4's performance above 57%.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7467
Loading