Keywords: Text Evaluation, LLM-as-a-Judge
Abstract: LLM-as-a-Judge has emerged as a popular alternative to traditional lexical and embedding-based evaluation metrics, offering improved correlation with human judgments. However, methods relying on heuristic prompts often suffer from misalignment. While recent approaches have incorporated optimization strategies (e.g., prompt iteration), they often lack a mechanism to dynamically evolve evaluation perspectives driven by prediction misalignment.
To address this limitation, we propose a misalignment-driven evolutionary evaluator (MAD-Eval) that treats evaluation alignment as an optimization process. MAD-Eval consists of three components: error-driven perspective evolution to refine evaluation perspectives, instance-aware expert routing to select perspectives tailored to each instruction, and adaptive aggregation to fuse perspective-level scores to align human judgments.
In MAD-Eval, misalignment serves as a unified feedback signal driving evolution across all stages: perspective evolution, expert routing, and aggregation.
Experiments demonstrate that MAD-Eval consistently outperforms state-of-the-art baselines in consistency with human judgments and transferability across different datasets.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Language Modeling, NLP Applications
Languages Studied: English
Submission Number: 5575
Loading