Keywords: sparse decomposability, mechanistic interpretability, sparse autoencoders, causal interventions, benchmarks, DAS, interchange intervention accuracy, dictionary learning, representation geometry, model evaluation
TL;DR: CSD benchmarks when dense causal subspaces can be recovered in fixed SAE bases, separating true sparse recovery from random-K degeneracy, weak sites, and geometry-limited failures.
Abstract: Benchmarks for mechanistic interpretability should test not only whether a causal variable can be localized, but whether the localized subspace can be recovered in a reusable representation basis. We propose sparse decomposability as a benchmarked diagnostic for dense causal subspaces: given a DAS-style teacher and a fixed pretrained SAE dictionary, causal sparse distillation (CSD) measures how much interchange-intervention behavior survives when the intervention is constrained to a small set of SAE latents. The diagnostic is calibrated on a 16-cell synthetic benchmark with ground-truth supports, where CSD-L1 recovers correlated-distractor support (F1 = 1.00) while DBM and DiffMean controls fail. On dense-valid Gemma/Qwen tuples, two compact pre-CSD decoder-geometry statistics predict CSD/dense recovery with leave-one-out R2 = 0.89 and bootstrap 95% CI [0.79, 0.95], while model size alone gives R2 = -2.00. Public-harness evaluations and matched diagnostic controls then separate positive selector-specific MCQA cases from random-K-degenerate RAVEL rows and sparse-limited 27B sites. Matched random-K controls show that high CSD/dense recovery is not by itself evidence of meaningful feature selection. The result is a benchmark-facing instrument: it maps where dense causal variables are sparse-decomposable in existing SAE bases and where benchmark scores should not be interpreted as SAE-level explanations.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 125
Loading