Keywords: LLMs, Reasoning Models, Reasoning Efficiency, CogniLoad Benchmark
TL;DR: We introduce a measure a of reasoning efficiency for SotA reasoning LLMs which we decompose to compare the LLMs unique strengths and weaknesses
Abstract: While large language models (LLMs) are typically evaluated on accuracy metrics alone, real-world deployment requires careful control of computational efficiency. Building on the CogniLoad benchmark, we introduce a unified efficiency metric (i.e., correct answers per 1'000 output tokens) enabling direct cross‑model comparison. We further provide an interpretable factorization into context robustness, logic robustness, and token appetite. Evaluating 15 state‑of‑the‑art reasoning LLMs on CogniLoad, we find that some models fail due to logic errors, others consume excessive tokens, and a few hit context limits. Tokens are an imperfect but practical proxy for computation load, permitting consistent comparisons across closed and open models. By decomposing overall efficiency into actionable components, our framework identifies concrete targets for improving LLM reasoning efficiency.
Submission Number: 66
Loading