From Rubrics to Recipe: Principle-Centric Benchmark for Evaluating Large Language Models

Shirley Anugrah Hayati; Ruizi Wang; Dongyeop Kang

From Rubrics to Recipe: Principle-Centric Benchmark for Evaluating Large Language Models

Shirley Anugrah Hayati, Ruizi Wang, Dongyeop Kang

Published: 29 Apr 2026, Last Modified: 13 May 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: principle, benchmark, evaluating, large language models

Abstract: Large language models (LLMs) are often evaluated on benchmarks that rely on surface-level instructions, obscuring what defines high-quality performance. We argue that tasks can be more precisely characterized through principles: human-readable rules that specify what matters for a good response to the task. Our study proposes a framework to automatically extract and generate task-level principles for data generation and evaluation. Using this approach, we build a benchmark of over 20K principle-aligned instances, enabling controllable data creation and fine-grained, interpretable assessment of LLMs. Experiments show that principles both improve output quality and scale evaluation beyond manual curation, offering a new recipe for principled assessment of LLM capabilities.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Type: Research Paper

Archival Status: Archival

Submission Number: 34

Loading