On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization

Janvijay Singh; Austin Xu; Yilun Zhou; Yefan Zhou; Dilek Hakkani-Tür; Shafiq Joty

On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization

Janvijay Singh, Austin Xu, Yilun Zhou, Yefan Zhou, Dilek Hakkani-Tür, Shafiq Joty

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, LLM-as-a-Judge, Distribution Shift, Generalization, Evaluation Robustness

TL;DR: We study the generalization and robustness of LLM-as-judge as new LLMs emerge, addressing practical questions about judge shelf life, including future proofing, backward compatibility, and question generalization.

Abstract: The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and fine-tuning. Recently, fine-tuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of fine-tuned judges regarding their real-world deployment. In this paper, we identify and formalize three aspects that affect the *shelf life* of these judges: *future-proofing* and *backward-compatibility* $-$ how well judges fine-tuned on responses by today's generator models perform on responses by future models or past models, as well as *question generalization* $-$ how well judges generalize to unseen questions at test time. We study these three aspects under a unified framework with varying train and test distributions in two reasoning datasets, three SFT- and DPO-based fine-tuning algorithms, and three different backbone models. Experiments suggest that future-proofing is challenging for most models, while backward-compatibility is relatively easy, with DPO-trained models consistently *improving* performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models exhibit some degree of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18969

Loading