Articulate Intuition or Genuine Analysis? Benchmarking Epistemic Reliability in LLM-as-a-Judge Peer Review
Keywords: Epistemology, Peer Review, Reliability, Philosophy of Science, Large Language Models
TL;DR: We show that “System 2” signals in peer-review judgments often reflect length and analytical style, not reasoning quality; reliable LLM-as-a-judge benchmarks must test epistemic function, not just form.
Abstract: When an LLM judge calls a peer review analytical and a human committee calls another review high quality, are they tracking the same thing? We argue they are not, and that the difference matters philosophically. We operationalise Kahneman's dual-process theory into a structured rubric for peer review and release Kahneman4Review, a benchmark of 3,563 rated reviews scored along nine theoretically motivated textual dimensions, eight bias diagnostics, and a continuous reasoning-quality score. Three findings bear on trustworthiness: decision tier is not detectably aligned with the rubric's text-grounded epistemic-quality proxy; public-showcase agentic reviews receive higher raw scores than pooled human reviews, but length and venue explain most of the gap and the samples are not paper-paired; and ICLR review-text diagnostics shift at the 2022--2023 transition, temporally coincident with widespread LLM availability but without identifying its cause. A matched function-probe pilot further shows that the rubric distinguishes textual probes designed to contrast genuine fault-finding with surface fluency. We argue that a trustworthy reliability benchmark for LLM judges must separate analytical form from epistemic function, and propose concrete design choices toward that goal. An interactive demo is available at https://huggingface.co/spaces/nuojohnchen/Kahneman4Review.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 5
Loading