D-MOE-EVAL: A Dynamic Mixture Of Experts Framework For Human-Aligned Nuanced Large Language Model Evaluation

D-MOE-EVAL: A Dynamic Mixture Of Experts Framework For Human-Aligned Nuanced Large Language Model Evaluation

ICLR 2026 Conference Submission25569 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Fine Grained Evaluation, Multi-Dimensional Evaluation, Mixture of Experts, Scenario Aware Evaluation

TL;DR: This paper proposes a scenario-aware, multi-dimensional LLM evaluation framework using a MoE approach, across multiple domains and profiling dimension-specific experts, deliberating through a Panel of Judges ensuring human-aligned nuanced evaluation.

Abstract: The growing paradigm of using Large Language Models (LLMs) as evaluators, known as LLM-as-a-Judge, offers significant scalability for automated assessment. However, this approach struggles from certain limitations. The different architectures and training of LLMs, leads them to develop varied expertise, making any single monolithic agent prone to bias and limited in adaptability across different reasoning scenarios. This inherent bottleneck leads to measurement imbalance across evaluation criteria and an over-prioritization of narrow technical correctness at the expense of diverse human-centered dimensions. To address these challenges, this paper presents a scenario-aware multi-dimensional evaluation framework that operationalizes a Mixture-of-Experts (MoE) architecture. The framework features instance-level scenario classification, dynamically mapping inputs to the most appropriate evaluation context, with each scenario linked to its own tailored set of evaluation dimensions. The dimension experts are specialized LLMs, dynamically selected after validation on a multi-dimensional dataset to systematically profile and identify their strengths across specified dimensions. This adaptive routing ensures that each instance receives a contextually relevant assessment across multiple complementary dimensions simultaneously. The expert evaluations are synthesized by a "Panel of Judges" as a deliberation layer, with multiple agents in structured debate to reconcile discrepancies and ensure fairness and logical consistency in the final judgments. The results of this study, evaluated over the MDEval and LLMBar benchmarks, demonstrate proposed framework’s superior performance on existing baselines across diverse tasks, showcasing the robustness, versatility, and generalizability of a Mixture-of-Experts approach for context-aware LLM evaluation.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 25569

Loading