Abstract: We present MILE-RefHumEval, a novel reference-free framework for evaluating Large Language Models (LLMs) without the need for ground-truth annotations or coordination among evaluators. It leverages multiple independently prompted LLMs and a 12-point human-aligned schema to generate nuanced, high-quality assessments. The framework demonstrates strong alignment with human judgment and consistently outperforms prior approaches. Importantly, it delivers these gains with substantially reduced computational overhead, making it a scalable, efficient, and human-aligned solution for evaluating LLMs in open-ended, real-world tasks.
Paper Type: Short
Research Area: Generation
Research Area Keywords: Automatic Evaluation, document-level extraction, zero/few-shot extraction, LLM/AI agents
Contribution Types: Position papers
Languages Studied: English
Submission Number: 2973
Loading