How Far Can SLMs Go Without `Thinking' in the LLM-as-a-Judge Paradigm?

Pratik Sridatt Jayarao; Himanshu Gupta; Neeraj Varshney; Chaitanya Dwivedi

How Far Can SLMs Go Without `Thinking' in the LLM-as-a-Judge Paradigm?

Pratik Sridatt Jayarao, Himanshu Gupta, Neeraj Varshney, Chaitanya Dwivedi

Published: 16 Oct 2025, Last Modified: 22 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: efficiency, reasoning, LLM-as-a-Judge, SLMs

TL;DR: Thinking Small Models are Efficient LLM Judges

Abstract: As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of “thinking” and “non-thinking” LLMs in the LLM-as-a-Judge paradigm using open-source Qwen-3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Furthermore, thinking models achieve approximately 10 percentage points higher accuracy with little relative overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases ($\sim6\%$ higher on average). We further extend our experiments to the multilingual setting, and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our results highlight that despite leveraging significantly more compute, non-thinking models fail to match the performance and robustness of their thinking counterparts, making explicit reasoning a more efficient and reliable choice for the LLM-as-a-Judge paradigm. Through this work, we motivate to invest in developing models with innate, low-cost reasoning capabilities, rather than relying on post-hoc augmentation techniques.

Submission Number: 262

Loading