Keywords: reasoning
TL;DR: Improved AI judge ratings on corpus synthesis tasks do not reliably correspond to gains that domain experts perceive.
Abstract: Do LLM-as-judge improvements on synthesis tasks correspond to gains that domain experts actually perceive? We investigate by scaffolding LLM corpus analysis with operations drawn from human cognitive science—assumption excavation, absence detection, pattern induction, dialectical challenge—each formalized as a structured prompt. In experiments across three domains (machine learning, hand surgery, defense policy; n=758 documents), the scaffolded pipeline consistently outperforms a single-shot baseline under Claude Opus 4.6 judges (+15.6% to +32.0%), confirmed by cross-model replication with Codex (GPT-5.2) judges (+9.4% to +25.3%) and GLM-5 open-weight judges (+7.6% to +26.9%). However, four blinded domain experts split: two preferred the baseline for concreteness and factual discipline, one preferred the pipeline for its inferential depth, and one found no meaningful difference. The pipeline’s strongest AI-judge dimension—assumption surfacing—was perceived as stating field-obvious truths by reviewers who preferred the baseline. This exposes a structural failure mode: AI judges reward structural explicitness—making implicit corpus features visible—while practitioners value epistemic novelty—information they did not already know. A rubric-free pairwise preference experiment (96.2% pipeline preference across two model families without scoring criteria) confirms the bias is intrinsic, not a rubric artifact. The underlying principle is portable: what is implicit in a specialized corpus is largely what experts already know, so mining implicit structure rediscovers field priors, and any AI judge will score this rediscovery as insight.
Paper Type: New Full Paper
Submission Number: 92
Loading