ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Zhilin Wang; Jaehun Jung; Ximing Lu; Shizhe Diao; Ellie Evans; Jiaqi Zeng; Pavlo Molchanov; Yejin Choi; Jan Kautz; Yi Dong

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: expert-annotated, professional knowledge, llm judge, rubric evaluation

TL;DR: We curate a PhD/MBA-level human-annotated rubrics dataset across Physics, Chemistry, Finance and Consulting with >7000 criterion-response pairs and introduce methods to mitigate bias and high cost of evaluation to make it fair and accessible to all.

Abstract: Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks.

Primary Area: datasets and benchmarks

Submission Number: 21097

Loading