Is Your Judge Truly Reasoning? Evaluating and Enhancing LLM-as-a-Judge with Simple Test-Time Scaling

Is Your Judge Truly Reasoning? Evaluating and Enhancing LLM-as-a-Judge with Simple Test-Time Scaling

ACL ARR 2026 January Submission7781 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-Judge, Reasoining Model, Test-Time Scaling

Abstract: While test-time scaling has revolutionized reasoning models, its application to automated evaluation (LLM-as-a-Judge) remains underexplored. We identify that existing judges, even those fine-tuned on reasoning data, fail to benefit from Simple Test-Time Scaling (STTS) and often degrade into verbose hallucinations. To address this, we introduce $\textbf{J1-7B}$, an LLM-as-a-Judge trained via a two-stage paradigm: Supervised Fine-Tuning on reflection-enhanced datasets followed by Reinforcement Learning with Verifiable Rewards (RLVR). Empirically, $\textbf{J1-7B}$ surpasses state-of-the-art open-source judges by around $\textbf{5.0}\%$ and exhibits $\textbf{4.1}\%$ relative performance gain under STTS. Crucially, our analysis reveals that effective scaling behavior emerges primarily during the RL phase. Furthermore, we provide an information-theoretic analysis demonstrating that $\textbf{J1-7B}$'s extended reasoning facilitates genuine semantic refinement, effectively distinguishing valid reliability from the artificial certainty and mode collapse. Finally, we show that extended reasoning achieves superior calibration, thereby rendering the model amenable to rigorous risk-controlled selective prediction.

Paper Type: Long

Research Area: Natural Language Generation

Research Area Keywords: Generation, Language Modeling, Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data analysis

Languages Studied: english

Submission Number: 7781

Loading