Keywords: LLM Evaluation, Fine-grained Assessment, Structured Evaluation, Semantic Unit Decomposition
Abstract: The rapid progress of large language models (LLMs) has made automatic evaluation of natural language generation an important yet challenging task. While recent LLM-as-a-judge approaches achieve promising performance, they typically rely on monolithic evaluation and produce scalar scores. Consequently, most existing LLM-based evaluators suffer from structural limitations arising from granularity stochasticity, evaluation signal compression, and implicit aggregation. In this work, we propose SEAL, a structured framework which rethinks the role of LLMs in automatic evaluation. Instead of exploiting LLMs as black-box scorers, we conceptualize evaluation as a structured process and treat LLMs as constrained semantic decision modules. Concretely, SEAL addresses existing limitations by decomposing evaluation into task-specific semantic units, formulating quality assessment as verifiable sub-dimension binary decisions, and enforcing deterministic aggregation, ensuring the evaluation process is structurally rigorous and interpretable. We evaluate SEAL across multiple tasks and benchmarks. Experimental results demonstrate that SEAL achieves state-of-the-art correlation with human judgments while providing fine-grained actionable insights. Our findings propose that structured evaluation offers a principled path toward trustworthy and reproducible LLM-based evaluation. Our code is available at https://anonymous.4open.science/r/SEAL-B837
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies, LLM/AI agents
Languages Studied: English
Submission Number: 4509
Loading