DIVERSITY OF THOUGHT IMPROVES REASONING ABILITIES OF LARGE LANGUAGE MODELS

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: diverse reasoning paths
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Diversity of thought improves the reasoning abilities of the large language models
Abstract: Large language models (LLMs) are documented to struggle in settings that require complex reasoning. Nevertheless, instructing the model to break the problem into smaller reasoning steps (Wei et al., 2022b), or ensembling various generations through decoding alterations (Wang et al., 2023) boosts performance. Current approaches assume the input prompt is fixed and expect the decoding strategies introduce diversity needed for ensembling. In this work, we relax this assumption and discuss how one can create and leverage variations of the input prompt as a means to diversity of thought to improve model performance. We propose a methodology to automatically improve prompt diversity by soliciting feedback from the LLM. In our new prompting approach, DIV-SE (DIVerse reasoning path Self-Ensemble), we use these diverse prompts as part of an ensemble across multiple inference calls. We also propose a cost-effective alternative where diverse prompts are used within a single inference call; we call this IDIV-SE (In-call DIVerse reasoning path Self-Ensemble). Under a fixed generation budget, DIVSE and IDIV-SE generate more accurate results than the previously discussed baselines using both GPT-3.5 and GPT-4 on several reasoning benchmarks, without modifying the decoding process. Additionally, DIV-SE advances state-of-the-art performance on recent planning benchmarks (Valmeekam et al., 2022), exceeding the highest previously reported accuracy by at least 29.6 percentage points on the most challenging 4/5 blocks task in the Blocksworld problem. Our results shed light on how to enforce prompt diversity towards LLM reasoning without increasing the generation budget.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6540
Loading