DPMI: A Principled Index for Neural Polysemanticity via Dirichlet Process Mixture Modeling

Published: 27 May 2026, Last Modified: 27 May 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Polysemanticity, Mechanistic Interpretability, Dirichlet Process Mixture Models, Superposition, Sparse Autoencoders, Jensen-Shannon Divergence
TL;DR: We introduce DPMI, a principled scalar index that quantifies neural polysemanticity by combining component count and separation using Dirichlet Process Mixture Models and Jensen-Shannon divergence.
Abstract: Polysemanticity, where a single neuron responds to multiple unrelated concepts, is a central obstacle in mechanistic interpretability, yet the field lacks a principled continuous scalar for it. We introduce the $\textbf{Dirichlet Process Polysemanticity Index}$ (DPMI), a per-neuron score that combines inferred component count and component separation by fitting a non-parametric DPGMM and weighting by mean pairwise Jensen-Shannon divergence. On a controlled toy benchmark, DPMI achieves Spearman $\rho = 0.755\$ and $AUROC \= 0.877\$, outperforming seven baselines. Against an independent Fourier-analytic ground truth from a modular-arithmetic transformer, DPMI remains significant with $\rho = 0.255\$ $\(p < 10^{-8}\)$. Across six architectures, we find a robust cross-modal law: language models are significantly more polysemantic than vision models $\(d = 0.803\$, $p < 10^{-129}\)$. Ablations show the non-parametric prior is essential (removing it drops $\rho\$ by $0.040\-0.045\)$, and DPMI-guided SAE budget allocation improves reconstruction $R^2\$ for the most polysemantic quartile by \(+0.010\) at fixed compute.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 50
Loading