Automated Construction of High-quality Evaluation Datasets Based on LLMs

Published: 2025, Last Modified: 04 Nov 2025ICIC (23) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Evaluation benchmarks play a vital role in advancing artificial general intelligence. However, current benchmarks face major challenges including data contamination, insufficient coverage of emerging tasks, and limited discriminative power. This paper introduces a novel automated method for constructing evaluation datasets using large language models (LLMs). Our key innovations include: (1) integrating educational assessment frameworks (Bloom's and SOLO taxonomies) into LLM evaluation, (2) implementing a role-based generation strategy with six specialized expert roles, and (3) developing a multi-round optimization mechanism with structured quality control. Through comprehensive experiments on mathematical reasoning, code generation, and reading comprehension tasks, our method demonstrates superior performance over existing approaches, achieving significant improvements in discrimination, reliability and validity. The proposed framework not only reduces manual effort in dataset construction but also provides a systematic solution for generating high-quality, comprehensive evaluation benchmarks, establishing new standards for LLM assessment.
Loading