LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Shibo Hao; Yi Gu; Haotian Luo; Tianyang Liu; Xiyan Shao; Xinyuan Wang; Shuhua Xie; Haodi Ma; Adithya Samavedhi; Qiyue Gao; Zhen Wang; Zhiting Hu

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, Zhen Wang, Zhiting Hu

Published: 11 Mar 2024, Last Modified: 22 Apr 2024LLMAgents @ ICLR 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: reasoning, evaluation

TL;DR: We propose a new evaluation, library, and analysis of step-by-step reasoning with large language models

Abstract: Reasoning is the pivotal skill in the evolution of Large Language Models (LLMs). Constructing step-by-step reasoning chains has been shown to be essential to enhancing their reasoning ability. This has resulted in a rich line of methods aimed at deriving better reasoning chains from LLMs. However, two significant challenges remain unaddressed: the lack of effective evaluation methods for reasoning chains, and the absence of systematic analysis of existing reasoning algorithms. In this work, we introduce RICE, a new LLM-based approach for automated evaluation of reasoning chains, which autonomously constructs a detailed evaluation criteria list for accurate and robust assessment. This method significantly outperforms existing metrics and is demonstrated to complement regular evaluation based on final answers. To tackle the second challenge, we present a unified framework to formulate existing reasoning algorithms. This leads to the creation of LLM Reasoners, a modular library aimed to simplify the research and deployment of advanced reasoning algorithms. It enables users to specify problem domains and reasoning strategies with minimal effort. Through comprehensive experiments across a range of reasoning tasks, we conducted analysis on representative reasoning methods, highlighting the importance of reward-guided search, the impact of search breadth, and the benefits of a world model in enhancing LLM reasoning capabilities.

Submission Number: 49

Loading