Register and [CLS] tokens induce a decoupling of local and global features in large ViTs

Alexander Lappe; Martin A. Giese

Register and [CLS] tokens induce a decoupling of local and global features in large ViTs

Alexander Lappe, Martin A. Giese

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Transformer, Register Token

TL;DR: Register tokens denoise attention maps but do not make them more interpretable.

Abstract: Recent work has shown that the attention maps of the widely popular DINOv2 model exhibit artifacts, which hurt both model interpretability and performance on dense image tasks. These artifacts emerge due to the model repurposing patch tokens with redundant local information for the storage of global image information. To address this problem, additional register tokens have been incorporated in which the model can store such information instead. We carefully examine the influence of these register tokens on the relationship between global and local image features, showing that while register tokens yield cleaner attention maps, these maps do not accurately reflect the integration of local image information in large models. Instead, global information is dominated by information extracted from register tokens, leading to a disconnect between local and global features. Inspired by these findings, we show that the [CLS] token itself leads to a very similar phenomenon in models without explicit register tokens. Our work shows that care must be taken when interpreting attention maps of large ViTs. Further, by clearly attributing the faulty behavior to register and [CLS] tokens, we show a path towards more interpretable vision models.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 21668

Loading