Keywords: LLM Reasoning, Mathematical Reasoning, Reasoning Behavior, Evaluating Reasoning, Compositional Generalization
TL;DR: We study the reasoning gap of frontier models by evaluating them on compositional pairs of existing problems.
Abstract: We study the depth of problem-solving capabilities of LLMs, and to what extent they perform mathematical reasoning in a compositional manner. To this end, we create a new benchmark by composing pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. We measure the difference between the performance of solving each question independently and solving the compositional pairs as the reasoning gap of a model. Our findings reveal a significant reasoning gap in most frontier LLMs. This gap is more pronounced in smaller and more cost-efficient models. The objective of this study is not to introduce yet another benchmark, but rather to provide a case study aimed at gaining deeper insights into current models' reasoning abilities, and to reassess existing established training methods and benchmarks.
Concurrent Submissions: Other workshops at Neurips
Submission Number: 82
Loading