Keywords: LLM-as-Judge, Geometric Mode Collapse, Subspace Alignment, Multilingual Evaluation, Community Health
Abstract: High inter-model agreement among LLM judges is often cited as evidence of reliability, but we challenge this: consensus frequently reflects geometric mode collapse, where LLMs flatten quality judgments onto low-rank fluency-dominated subspaces, erasing culturally-grounded dimensions. We formalize evaluation as a lossy projection operator and validate this on a multilingual health benchmark (Hindi, Kannada, Malayalam; 15 medical professionals; 600+ judgments). Findings: (1) variance compression ($\sigma_{\text{LLM}}/\sigma_{\text{Human}} = 0.69$; 93-95\% null-space); (2) orthogonal subspaces ($>79°$ angles; $<7\%$ shared variance); (3) resource-stratified collapse (Malayalam 85.9\%). We introduce subspace alignment theory: LLMs achieve high agreement through redundant projection onto shared fluency manifolds orthogonal to human judgment, consensus signals redundancy, not validity. We provide geometric diagnostics (rank, null-space, cross-lingual transfer) and outline subspace augmentation remedies, reframing LLM-as-Judge reliability as a geometry problem with actionable solutions.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Resources and Evaluation, Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Hindi, Kannada, Malayalam
Submission Number: 10549
Loading