The Rules of the Game: A Survey of Rubrics for Large Language Models

Wenhan Liu, Jiajie Jin, Zhaoheng Huang, Tongyu Wen, Guanting Dong, Ziliang Zhao, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen

Published: 25 May 2026, Last Modified: 25 May 2026OpenReview Archive Direct UploadEveryoneRevisionsCC BY 4.0

Abstract: Large language models (LLMs) have rapidly evolved from general text generators into increasingly capable systems for reasoning, decision-making, tool use, and long-horizon problem solving. As their application scenarios expand toward open-ended and high-stakes tasks, including deep research, medical diagnosis, multimodal generation, and agentic tool use, the question of how to specify, optimize, and evaluate model responses has become increasingly important. Simple correctness signals, holistic preference scores, and unconstrained LLM-based judgments are often insufficient for these settings, where response quality depends on multiple criteria such as factuality, completeness, safety, reasoning soundness, evidence grounding, and practical utility. Rubrics have therefore emerged as a promising mechanism for making evaluation standards explicit and operational. By decomposing broad quality expectations into structured and interpretable criteria, rubrics provide an interface for both training supervision and model evaluation. This survey presents a comprehensive and systematic overview of rubric-based research for LLMs. We first clarify the concept of rubrics and distinguish it from closely related concepts, including reward models, verifiable rewards, and LLM-as-a-judge. We then organize existing studies along three major directions. First, we summarize existing rubric construction methods and organize them into four categories: direct generation, contrastive generation, iterative refinement, and online or co-evolving generation. Second, we examine how rubrics support the training of policy models and reward models. For policy model training, we organize existing studies by their training mechanisms. For reward model training, we categorize prior work according to the functional roles that rubrics play in reward modeling. Third, we summarize rubric-driven task evaluation for both general and domain-specific tasks, and discuss the evaluation benchmarks from various perspectives. Beyond consolidating existing work, we discuss a series of key open questions, such as rubric reward hacking, the bias in rubric-based evaluation, personalization, and rubric safety. We hope this survey can serve as a structured reference for current research and a conceptual foundation for developing rubrics as transparent, adaptive, and trustworthy interfaces for future LLM systems. Given the rapid development of rubric-based research, we will keep this survey updated to incorporate new advances and emerging directions in this area.