Abstract: Sparse autoencoders (SAEs) trained on the same model learn seed-dependent dictionaries,
raising the question of whether features found by one run correspond to those found by another.
We introduce a benchmark that evaluates cross-seed matching methods on functional
grounds, beyond geometric similarity, using two complementary tests: per-feature ablation
fingerprints for scalable screening, and a substitution test that directly measures functional
interchangeability by swapping one SAE’s feature contribution for another’s. Both tests are
validated against hard negative controls and stratified by feature activity level.
Evaluating eight matching methods on BatchTopK and ReLU SAEs (five seeds, Pythia-410M
layers 4, 8, and 12, with replication on GPT-2 Small), we find that cross-seed correspondence
exhibits a quality/coverage tradeoff analogous to precision/recall. At the top of the ranking,
greedy cosine and Sinkhorn optimal transport perform equally well (R = 0.86 at top-100);
in the tail, Sinkhorn with uniform marginals retains higher quality (R = 0.60 vs. 0.52 at
top-2000), achieving the highest overall AUSQC (area under the substitution-quality curve).
Results are validated on a held-out corpus with seed-level bootstrap confidence intervals.
All claims are restricted to the fingerprinted feature subset (∼42%), and we show that effect
sizes attenuate for low-activity features. The benchmark protocol is designed so that future
consistency methods can be evaluated on the same footing, providing a shared standard for
measuring progress on feature reproducibility.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Qitian_Wu1
Submission Number: 7897
Loading