Keywords: LLM; Reasoning; Benchmark
Abstract: Large language models (LLMs) such as GPT-4, Claude 3, and the Gemini series have pushed the frontier of automated reasoning and code generation. Yet, prevailing benchmarks emphasize accuracy and output quality, neglecting a critical dimension: decoding token efficiency. In real systems, the difference between generating 10K tokens vs 100K tokens is nontrivial in latency, cost, and energy. In our work, we introduce OckBench, the first model-agnostic, hardware-agnostic benchmark that jointly measures accuracy and decoding token count for reasoning and coding tasks. Through experiments comparing multiple open- and closedsource models, we uncover that many models with comparable accuracy differ wildly in token consumption, revealing that efficiency variance is a neglected but significant axis of differentiation. We further demonstrate Pareto frontiers over the accuracy–efficiency plane and argue for an evaluation paradigm shift: we should no longer treat tokens as “free” to multiply. OckBench provides a unified platform for measuring, comparing, and guiding research in token-efficient reasoning.
Submission Number: 303
Loading