Transforming Expert Insight into Scalable AI Assessment: A Framework for LLM-Generated Metrics and User-Calibrated Evaluation

Published: 15 Jun 2025, Last Modified: 07 Aug 2025AIA 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: qualitative-to-quantitative metrics, LLM evaluation, expert calibration, adaptive assessment, learning design, human-AI alignment
TL;DR: This paper introduces an LLM-driven framework for converting qualitative expert feedback into quantitative metrics, enabling scalable and reliable AI assessment through rigorous expert calibration.
Abstract: Effectively assessing AI systems, particularly those operating in specialized domains or requiring dynamic outputs, necessitates translating nuanced human expertise into scalable, quantitative measures. Traditional metrics often fall short in capturing qualitative requirements that domain experts intuitively grasp. This paper presents a novel framework that systematically transforms qualitative expert feedback into quantitative assessment metrics. Our methodology leverages Large Language Models (LLMs), first to help articulate and formalize these metrics from expert input, and subsequently as ``judges'' to apply them in an automated fashion. As validation, we present initial expert calibration results, ensuring automated assessments align with human judgment and can evolve with changing requirements. Learning content creation serves as our illustrative specialized domain; its reliance on learning design frameworks, coupled with the need for nuanced expert evaluation of pedagogical quality, makes it an ideal test case for our framework. Results confirm that our LLM-generated, expert-calibrated metrics achieve promising alignment with expert evaluations, enabling robust, scalable, and adaptable assessment.
Paper Type: New Full Paper
Submission Number: 18
Loading