Keywords: Large Language Models; Chain-of-thought; Complex Reasoning; Evaluation
TL;DR: We curate a suite of evaluation datasets to track large language models reasoning capability
Abstract: As large language models (LLMs) are continuously being developed, their evaluation becomes increasingly important yet challenging.
This work proposes Chain-of-Thought Hub, an open-source evaluation suite on the multi-step reasoning capabilities of large language models.
We are interested in this setting for two reasons:
(1) from the behavior of GPT and PaLM model family,
we observe that complex reasoning is likely to be a key differentiator between weaker and stronger LLMs;
(2) we envisage large language models to become the next-generation computational platform and foster an ecosystem of LLM-based new applications, this naturally requires the foundation models to perform complex tasks that often involve the composition of linguistic and logical operations.
Our approach is to compile a suite of challenging reasoning benchmarks to track the progress of LLMs.
Our current results show that:
(1) model scale clearly correlates with reasoning capabilities;
(2) As of May 2023, Claude-v1.3 and PaLM-2 are the only two models that are comparable with GPT-4, while open-sourced models still lag behind;
(3) LLaMA-65B performs closely to code-davinci-002, indicating that with successful further development such as reinforcement learning from human feedback (RLHF), it has great potential to be close to GPT-3.5-Turbo.
Our results also suggest that for the open-source efforts to catch up, the community may focus more on building better base models and exploring RLHF.
Submission Number: 16
Loading