Abstract: LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM's judge ability. In this work, we conceptualize judging ability as a general capability of LLMs and adapt the two-stage SFT-DPO training framework—commonly used in traditional general model training—to the development of judge models. We introduce an efficient data synthesis method, which includes the automatic generation of various judge templates, dual verification for data accuracy and consistency. A difficulty-based data stratification strategy allows us to distribute more effective data to the SFT and DPO stages respectively. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task with CoT outputs. We further validate the effectiveness of our model by deploying it to provide reward signals in a real-world RLHF scenarios. We will open-source our model weights and training data to facilitate further research.
Paper Type: Long
Research Area: Generation
Research Area Keywords: automatic evaluation, text-to-text generation
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English, Chinese
Keywords: automatic evaluation, text-to-text generation, LLM-as-a-Judge, SFT, DPO
Submission Number: 3471
Loading