ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

Published: 01 Mar 2026, Last Modified: 01 Mar 2026P-AGIEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 1: Technical Foundations for a Post-AGI World
Keywords: large language models, reasoning, benchmark, evaluation, reproducibility, variability, uncertainty quantification
TL;DR: ReasonBENCH is an open-source benchmark for variance- and cost-aware evaluation of LLM reasoning.
Abstract: Large language model (LLM) reasoning is typically evaluated using single runs, masking how much performance can vary across repeated executions. This practice obscures both reliability and cost, and can lead to misleading comparisons between reasoning methods and models. We introduce ReasonBENCH, a benchmark suite and open-source library for controlled multi-run evaluation of LLM reasoning. For each model–strategy–task configuration, we perform repeated trials across 6 diverse benchmarks and report variance-aware metrics for both quality and cost, including confidence intervals and run-to-run variability measures. Using standardized implementations, we benchmark 10 widely used reasoning strategies under identical model conditions and evaluate 10 contemporary reasoning-oriented LLMs in a zero-shot setting. Our results show that run-to-run variability is substantial, benchmark-dependent, and often large enough to change model/method rankings relative to single-run averages. Additional analyses reveal that scaling within a model family improves both average quality and stability, while increasing test-time reasoning effort primarily increases cost without yielding statistically significant quality gains. Together, these findings motivate distribution-aware evaluation practices and provide reproducible tooling to support more reliable progress in LLM reasoning research. ReasonBENCH is publicly available at [https://anonymous.4open.science/r/ReasonBench-64B3](https://anonymous.4open.science/r/ReasonBench-64B3).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Presenter: ~Nearchos_Potamitis1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 10
Loading