Interpretable Representation Evaluation — A Spectral Principle for Probe Reliability

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: probe reliability, interpretability, representation evaluation, spectral analysis, eigengap, Fisher information
TL;DR: Probe reliability follows spectral separation: clear gaps ensure trust, gap closure signals failure.
Abstract: Linear probes are widely used to interpret and evaluate learned representations, yet their reliability is often questioned: probes can appear accurate in some regimes but collapse unpredictably in others. We identify the spectral mechanism behind this phenomenon and develop a spectral identifiability principle that serves as a practical diagnostic. Specifically, when the Fisher information spectrum maintains a nontrivial eigengap separating the discriminative subspace, the estimated subspace concentrates and probe accuracy remains stable; when the gap vanishes, accuracy collapses in a phase-transition manner. Our analysis connects eigengap geometry, sample size, and probe reliability through finite-sample reasoning, but framed as an interpretable criterion rather than a generic error bound. Controlled synthetic studies confirm the predicted transitions, and the framework highlights how eigenspectrum inspection can warn of unreliable probe evaluations before they mislead downstream model assessment.
Primary Area: interpretability and explainable AI
Submission Number: 24726
Loading