Autonomous Tree-search Ability of Large Language Models

23 Sept 2023 (modified: 03 Feb 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Large Language Model, Application, Reasoning, Tree Search
TL;DR: Large Language Models autonomously do tree-search to solve puzzles without the aid of external programs.
Abstract: Large Language Models (LLMs) have excelled in remarkable reasoning capabilities with advanced prompting techniques (e.g., Chain-of-Thought), but they fall short on tasks that require exploration, strategic foresight, and sequential decision-making. Recent works propose to utilize external programs (e.g., Python codes) to define search logic, such that LLMs can perform passive tree search to solve more challenging reasoning tasks. Though impressive results have been achieved, there are several fundamental limitations of these approaches. First, passive tree searches are not efficient as they usually require multiple rounds of LLM API calls to solve one single problem. Moreover, passive search methods are not flexible since they need task-specific program designs. Then a natural question arises: can we maintain the tree-search capability of LLMs without the aid of external programs, and can still generate responses that clearly demonstrate the process of a tree-structure search? To this end, we propose a new concept called autonomous tree-search ability of LLM, which can automatically generate a response containing search trajectories for the correct answer. Concretely, we first perform both BFS and DFS style search trajectories using more capable LLM API (e.g. GPT-4 and GPT-3.5) via a fixed system prompt, allowing them to perform autonomous tree-search (ATS) right out of the box. Experiments on 4 challenge puzzle games demonstrate our method can achieve huge improvements. The ATS-BFS method outperforms the Chain of Thought approach by achieving an average accuracy improvement of 33\%. Compared to Tree of Thoughts, it requires 65.6\% or 47.7\% less GPT-api cost to attain a comparable level of accuracy. Moreover, we have collected a dataset using the ATS prompt method and fine-tuned LLaMA with this dataset. This approach has shown to yield a greater improvement compared to the ones fine-tuned on CoT data. Specifically, it outperforms CoT-tuned LLaMAs by an average of 40.6\% and 38.5\% for LLaMA2-7B and LLaMA2-13B, respectively.
Supplementary Material: zip
Primary Area: applications to neuroscience & cognitive science
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 7438
Loading