Track: long paper (up to 8 pages)
Keywords: geometric regularization, interpretable representations, concept-based learning, hypersphere constraints, representation engineering, behavioral steering, minimal supervision, developmental interpretability
TL;DR: We use geometric regularizers to anchor concepts to known directions on the hypersphere, enabling targeted behavioral interventions with minimal supervision.
Abstract: We introduce **Sparse Concept Anchoring**, a method that biases latent space to position a targeted subset of concepts while allowing others to self-organize, using only minimal supervision (in our setting, labels for $<0.1\%$ of examples per anchored concept). Training combines activation normalization, a separation regularizer, and anchor or subspace regularizers that attract rare labeled examples to predefined directions or axis-aligned subspaces. The anchored geometry enables two practical interventions: reversible behavioral steering that projects out a concept's latent component at inference, and permanent removal via targeted weight ablation of anchored dimensions. Experiments on structured autoencoders show selective attenuation of targeted concepts with negligible impact on orthogonal features, and complete elimination with reconstruction error approaching theoretical bounds. Sparse Concept Anchoring therefore provides a practical pathway to interpretable, steerable behavior in learned representations.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 19
Loading