The Necessity of Setting Temperature in LLM-as-a-Judge

ACL ARR 2026 May Submission14405 Authors

26 May 2026 (modified: 18 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM-as-a-Judge, LLM Evaluation, Decoding Temperature
Abstract: Using large language models (LLMs) as judges for evaluating model outputs has emerged as an important paradigm for automated evaluation. However, the choice of decoding temperature in LLM-as-a-Judge settings is still largely chosen empirically, with limited systematic evidence on its impact. To address this gap, we conduct a systematic study of how temperature affects judgment behavior across different LLM judge models, prompting strategies, and evaluation paradigms. Our results show that higher temperatures generally decrease judgment consistency and increase formatting errors, while also exposing latent uncertainty that tends to remain suppressed under low-temperature decoding, particularly in ambiguous cases. Further analysis suggests that higher temperatures can serve as an exploratory mechanism and may improve judging performance in complex or uncertain evaluation scenarios. Overall, low-temperature settings are better suited to tasks that prioritize stability and reproducibility, whereas higher-temperature settings are more appropriate for scenarios involving substantial ambiguity or complexity, where exploration of the judge’s decision space is beneficial. These findings suggest that, in LLM-as-a-Judge systems, temperature should be treated not as a fixed hyperparameter, but as a controllable, task-dependent design choice that mediates the trade-off between reliability and exploration.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Automatic evaluation, Reproducibility, Statistical testing for evaluation, Prompting analysis
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 14405
Loading