Extracting Reliable Concept Signals from Just a Handful of Superdetector Tokens

Extracting Reliable Concept Signals from Just a Handful of Superdetector Tokens

NeurIPS 2025 Workshop NeurReps Submission100 Authors

30 Aug 2025 (modified: 29 Oct 2025)Submitted to NeurReps 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Concept Vectors, Superdetector Tokens, Interpretability, Transformers, Attribution Methods, Representation Learning

TL;DR: We show that transformers concentrate concept signals in a small fraction of highly activated “Superdetector Tokens," and leverage them for more accurate and faithful concept detection and localization.

Abstract: Concept vectors aim to connect model representations with human-interpretable semantics, but their signals are often noisy and inconsistent, limiting their reliability. In this work, we discover a new property: across labeled concept regions, concept information is concentrated in a small fraction of highly activated tokens, which we call "Superdetector Tokens." We demonstrate that Superdetector Tokens provide more reliable concept signals than traditional concept vector and prompting methods, and enable more faithful attributions. Our results suggest that this behavior reflects a general mechanism by which transformers encode semantics, holding across image and text modalities, model families, and supervised and unsupervised extraction methods.

Submission Number: 100

Loading