Keywords: Vision Transformer, Register Token
TL;DR: Register tokens denoise attention maps but do not make them more interpretable.
Abstract: Recent work has shown that the attention maps of the widely popular DINOv2
model exhibit artifacts, which hurt both model interpretability and performance
on dense image tasks. These artifacts emerge due to the model repurposing patch
tokens with redundant local information for the storage of global image information.
To address this problem, additional register tokens have been incorporated in which
the model can store such information instead. We carefully examine the influence
of these register tokens on the relationship between global and local image features,
showing that while register tokens yield cleaner attention maps, these maps do
not accurately reflect the integration of local image information in large models.
Instead, global information is dominated by information extracted from register
tokens, leading to a disconnect between local and global features. Inspired by these
findings, we show that the [CLS] token itself leads to a very similar phenomenon
in models without explicit register tokens. Our work shows that care must be taken
when interpreting attention maps of large ViTs. Further, by clearly attributing
the faulty behavior to register and [CLS] tokens, we show a path towards more
interpretable vision models.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 21668
Loading