CompassJudger-2: A Holistic Approach Towards Generalist Judge Model

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, LLM-as-a-Judge, Reward Model
Abstract: Recently, the role of LLM-as-judge in evaluating large language models has gained prominence, emerging as an important method to partially replace costly human assessment. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we decompose the ability of a generalist and generative judge model into three levels: objective verification, subjective evaluation, and rubric refinement. We present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-scenarios data curation strategy. We conducted large-scale data collection for each type of task and designed tailored rejection sampling strategies to filter the data, ensuring data diversity, accuracy, and effectiveness. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, demostrating its excellent robustness and generalization ability. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18738
Loading