Keywords: LLM Alignment; LLM-as-a-Judge; Alignment Evaluation; Preference Optimization
Abstract: The evaluation of LLM alignment is typically conducted in a reference-free manner that does not rely on reference outputs. This prevents the direct adaptation of recent LLM training methods that are based on verifiable metrics or rewards which rely on ground-truth outputs.
In this work, we investigate whether reference outputs can be effectively leveraged for improving LLM alignment. To this end, we first design evaluation methods that enhance LLM-based evaluators with high-quality reference outputs. Through comprehensive experiments, we show that the reference-guided evaluation method substantially improves the performance of less capable LLM-evaluators, using references generated by frontier LLMs. Moreover, strong LLM-evaluators can be further enhanced by human-written references. We then demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. The results show that reference-based LLMs-as-Judges yield clear gains over reference-free baselines in this semi-self-improvement setting, and achieve performance comparable to training with finetuned reward models. In particular, reference-guided self-improvement achieves scores of 73.1\% and 58.7\% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B. These results highlight the great potential of leveraging references for LLM training in non-verifiable domains using reference-guided LLM-based evaluators.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21068
Loading