SEAL: Structured Evaluation via LLMs for Text Generation

SEAL: Structured Evaluation via LLMs for Text Generation

ACL ARR 2026 January Submission4509 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Evaluation, Fine-grained Assessment, Structured Evaluation, Semantic Unit Decomposition

Abstract: The rapid progress of large language models (LLMs) has made automatic evaluation of natural language generation an important yet challenging task. While recent LLM-as-a-judge approaches achieve promising performance, they typically rely on monolithic evaluation and produce scalar scores. Consequently, most existing LLM-based evaluators suffer from structural limitations arising from granularity stochasticity, evaluation signal compression, and implicit aggregation. In this work, we propose SEAL, a structured framework which rethinks the role of LLMs in automatic evaluation. Instead of exploiting LLMs as black-box scorers, we conceptualize evaluation as a structured process and treat LLMs as constrained semantic decision modules. Concretely, SEAL addresses existing limitations by decomposing evaluation into task-specific semantic units, formulating quality assessment as verifiable sub-dimension binary decisions, and enforcing deterministic aggregation, ensuring the evaluation process is structurally rigorous and interpretable. We evaluate SEAL across multiple tasks and benchmarks. Experimental results demonstrate that SEAL achieves state-of-the-art correlation with human judgments while providing fine-grained actionable insights. Our findings propose that structured evaluation offers a principled path toward trustworthy and reproducible LLM-based evaluation. Our code is available at https://anonymous.4open.science/r/SEAL-B837

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation methodologies, LLM/AI agents

Languages Studied: English

Submission Number: 4509

Loading