SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals

Published: 30 Sept 2025, Last Modified: 10 Nov 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Open Source Links: https://github.com/BrachioLab/SuperActivators.git
Keywords: Understanding high-level properties of models
Other Keywords: Concept Vectors, Transformers, Interpretability, Token Activations, Cross-Modal, Attributions
TL;DR: Only the most highly activated tokens, SuperActivators, carry reliable concept signals, outperforming CLS and prompt-based methods and enabling better localization.
Abstract: Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a clear, reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches—achieving up to a 14\% higher $F_1$ score—across diverse image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage these SuperActivator tokens to improve feature attributions for concepts.
Submission Number: 201
Loading