Keywords: retrieval-augmented generation, dense retrieval, embedding evaluation, topological data analysis, Mapper algorithm, Von Neumann entropy, graph Laplacian, spectral methods, high-dimensional geometry, unsupervised evaluation
TL;DR: The area under the Von Neumann entropy curve of a Mapper graph built from an unlabelled embedding sample predicts RAG retrieval quality across two unrelated domains, giving a label-free diagnostic for embedding-space configuration.
Abstract: We show that a single spectral quantity computed on a topological Mapper graph over a small unlabelled sample of an embedding corpus predicts retrieval quality without any labelled queries. We apply the Mapper algorithm from topological data analysis to the embedded corpus, constructing a multi-level graph in which nodes represent clusters of nearby documents and edges encode cluster overlap. As connectivity within each Mapper level increases from sparse to dense, we track the Von Neumann entropy of the normalised graph Laplacian of the largest connected component, producing a multi-scale entropy curve. The area under this curve (AUC) is negatively correlated with retrieval quality in two structurally dissimilar domains: financial document retrieval from SEC filings (Spearman $r = -0.68$, $p = 0.001$, $n = 20$ conditions) and tabular data retrieval over Wikipedia tables ($r = -0.47$, $p < 0.01$, $n = 20$ conditions). Both results are statistically significant. Low AUC indicates that the Mapper graph maintains concentrated, non-uniform spectral structure across all connectivity scales, suggesting a corpus with well-separated semantic clusters.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 121
Loading