Abstract: Large language models such as GPT-3
and PaLM have shown remarkable perfor-
mance in few-shot learning. However, they
still struggle with reasoning tasks such as
the arithmetic benchmark GSM8K. Recent
advances deliberately guide the language
model to generate a chain of reasoning steps
before producing the final answer, success-
fully boosting the GSM8K benchmark from
17.9% to 58.1% in terms of problem solv-
ing rate. In this paper, we propose a new
approach, DIVERSE (Diverse Verifier on
Reasoning Step), to further advance their rea-
soning capability. DIVERSE first explores
different prompts to enhance the diversity in
reasoning paths. Second, DIVERSE intro-
duces a verifier to distinguish good answers
from bad answers for a better weighted vot-
ing. Finally, DIVERSE verifies the correct-
ness of each single step rather than all the
steps in a whole. We conduct extensive ex-
periments using the latest language model
code-davinci-002 and demonstrate that DI-
VERSE can achieve new state-of-the-art per-
formance on six out of eight reasoning bench-
marks (e.g., GSM8K 74.4% → 83.2%), out-
performing the PaLM model with 540B pa-
rameters
0 Replies
Loading