SuperActivators: Transformers Concentrate Concept Signals in Just a Handful of Tokens

SuperActivators: Transformers Concentrate Concept Signals in Just a Handful of Tokens

ICLR 2026 Conference Submission20435 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Concept Vectors, Transformers, Interpretability, Token Activations, Cross-Modal, Attributions

TL;DR: We show that transformers concentrate concept signals in a small set of highly activated SuperActivator tokens, which we leverage to improve concept detection and localization.

Abstract: Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within this noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a clear, reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches—achieving up to a 14% higher F1 score—across diverse image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage these SuperActivator tokens to improve feature attributions for concepts.

Primary Area: interpretability and explainable AI

Submission Number: 20435

Loading