Benchmarking Cross-Seed Feature Correspondence in Sparse Autoencoders

TMLR Paper7897 Authors

11 Mar 2026 (modified: 14 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Sparse autoencoders (SAEs) trained on the same model learn seed-dependent dictionaries, raising the question of whether features found by one run correspond to those found by another. We introduce a benchmark that evaluates cross-seed matching methods on functional grounds, beyond geometric similarity, using two complementary tests: per-feature ablation fingerprints for scalable screening, and a substitution test that directly measures functional interchangeability by swapping one SAE’s feature contribution for another’s. Both tests are validated against hard negative controls and stratified by feature activity level. Evaluating eight matching methods on BatchTopK and ReLU SAEs (five seeds, Pythia-410M layers 4, 8, and 12, with replication on GPT-2 Small), we find that cross-seed correspondence exhibits a quality/coverage tradeoff analogous to precision/recall. At the top of the ranking, greedy cosine and Sinkhorn optimal transport perform equally well (R = 0.86 at top-100); in the tail, Sinkhorn with uniform marginals retains higher quality (R = 0.60 vs. 0.52 at top-2000), achieving the highest overall AUSQC (area under the substitution-quality curve). Results are validated on a held-out corpus with seed-level bootstrap confidence intervals. All claims are restricted to the fingerprinted feature subset (∼42%), and we show that effect sizes attenuate for low-activity features. The benchmark protocol is designed so that future consistency methods can be evaluated on the same footing, providing a shared standard for measuring progress on feature reproducibility.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Qitian_Wu1
Submission Number: 7897
Loading