Keywords: Automated Evaluation, Large Language Model, Explainable Reasoning, Aggregation Methods
Abstract: Evaluating complex texts across domains requires converting user-defined criteria into quantitative, explainable indicators, which is a persistent challenge in search and recommendation systems. Single-prompt LLM evaluations suffer from complexity and latency issues, while criterion-specific decomposition approaches rely on naive averaging or opaque black-box aggregation. We present an interpretable aggregation framework combining LLM scoring with the Analytic Hierarchy Process (AHP). Our method generates criterion-specific scores via LLM-as-judge, measures discriminative power using Hellinger distance, and derives statistically grounded weights through AHP pairwise comparison matrices. Experiments on Amazon review helpfulness prediction, summarization quality assessment, and depression-related text scoring demonstrate that our approach achieves high explainability and operational efficiency while maintaining predictive power comparable to black-box alternatives, making it suitable for latency-sensitive web services.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: educational applications, essay scoring
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Submission Number: 8103
Loading