Unified Deployment-Aware Evaluation of Open Reasoning Language Models

Unified Deployment-Aware Evaluation of Open Reasoning Language Models

TMLR Paper9046 Authors

19 May 2026 (modified: 31 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Open reasoning language models are often compared under mixed sample sizes, partially standardized prompts, and accuracy-centered summaries, which makes practical model selection difficult to interpret. We present a unified evaluation of seven open reasoning language model configurations across four benchmarks, namely ARC-Challenge, GSM8K, MATH levels 1 to 3, and TruthfulQA MC1, under three prompting strategies: zero-shot, chain-of-thought (CoT), and few-shot CoT. Every model--dataset--strategy condition is evaluated on the same 238-example subset, which yields a complete 7 × 4 × 3 design with 84 conditions and 19,992 evaluated examples. In addition to accuracy, we report Wilson confidence intervals, latency, peak video random access memory (VRAM), weighted aggregate performance, Pareto-efficient operating points, prompt-sensitivity metrics, and compatibility diagnostics. Under this unified protocol, the highest weighted score is achieved by Gemma-4-26B-A4B with zero-shot prompting at 0.794, while Gemma-4-E4B remains close to the top across prompting settings with substantially lower latency and memory, making it a particularly attractive practical operating point. Bootstrap and paired-permutation analyses show that top weighted configurations are close enough that deployment tradeoffs remain important. We further find that prompting strategy changes ranking order rather than simply shifting all models in the same direction, and that benchmark-specific complementarity creates measurable routing headroom: an oracle task-aware selector reaches a weighted score of 0.825. Finally, compatibility diagnostics reveal that some apparent failures, especially for Phi-4-Reasoning on GSM8K, reflect deployment-relevant robustness and interface-adherence problems under the shared evaluation pipeline. These results support a central claim: open-model evaluation should be framed as a deployment-aware, multi-objective operating-point problem rather than as a single-score leaderboard exercise.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Li_Erran_Li1

Submission Number: 9046

Loading