Making Language Models Better Reasoners with Step-Aware Verifier

Yifei Li, Zeqi Lin, Dylan Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, Weizhu Chen

13 May 2023 (modified: 13 May 2023)OpenReview Archive Direct UploadReaders: Everyone

Abstract: Large language models such as GPT-3 and PaLM have shown remarkable perfor- mance in few-shot learning. However, they still struggle with reasoning tasks such as the arithmetic benchmark GSM8K. Recent advances deliberately guide the language model to generate a chain of reasoning steps before producing the final answer, success- fully boosting the GSM8K benchmark from 17.9% to 58.1% in terms of problem solv- ing rate. In this paper, we propose a new approach, DIVERSE (Diverse Verifier on Reasoning Step), to further advance their rea- soning capability. DIVERSE first explores different prompts to enhance the diversity in reasoning paths. Second, DIVERSE intro- duces a verifier to distinguish good answers from bad answers for a better weighted vot- ing. Finally, DIVERSE verifies the correct- ness of each single step rather than all the steps in a whole. We conduct extensive ex- periments using the latest language model code-davinci-002 and demonstrate that DI- VERSE can achieve new state-of-the-art per- formance on six out of eight reasoning bench- marks (e.g., GSM8K 74.4% → 83.2%), out- performing the PaLM model with 540B pa- rameters

0 Replies