Keywords: Concept Vectors, Superdetector Tokens, Interpretability, Transformers, Attribution Methods, Representation Learning
TL;DR: We show that transformers concentrate concept signals in a small fraction of highly activated “Superdetector Tokens," and leverage them for more accurate and faithful concept detection and localization.
Abstract: Concept vectors aim to connect model representations with human-interpretable semantics, but their signals are often noisy and inconsistent, limiting their reliability. In this work, we discover a new property: across labeled concept regions, concept information is concentrated in a small fraction of highly activated tokens, which we call "Superdetector Tokens." We demonstrate that Superdetector Tokens provide more reliable concept signals than traditional concept vector and prompting methods, and enable more faithful attributions. Our results suggest that this behavior reflects a general mechanism by which transformers encode semantics, holding across image and text modalities, model families, and supervised and unsupervised extraction methods.
Submission Number: 100
Loading