Abstract: Vision Transformers (ViTs) have achieved State-Of-The-Art (SOTA) performance in nu-
merous tasks. However, the emergence of high-norm artifact tokens in supervised and
self-supervised ViTs hinders interpretability of attention maps of such models. This study
reproduces and validates previous work (5) addressing this issue through the use of register
tokens - learnable placeholders added to the input sequence - that mitigate artifacts and
yield smoother feature maps. We evaluated the presence of artifacts in various ViT models,
namely DeiT-III and DINOv2 architectures, and investigated the impact of fine-tuning pre-
trained ViTs with register tokens and additional regularization introduced. By conducting
experiments on pre-trained and fine-tuned models, we confirm that register tokens eliminate
artifact and improve attention map interpretability.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: /forum?id=w9pgM58H05
Changes Since Last Submission:
- Added semantic segmentation experiments on ADE20K using DINOv2-L and DeiT-III-S to evaluate the impact of register tokens.
- Included downstream results for DeiT-III-Small with 0, 1, 2, and 4 register tokens; clarified that DINOv2 results use a fixed 4-register-token pretrained backbone.
- Added histogram plots (Figure 9) showing the reduction of artifact tokens via fine-tuning, based on the L2 norm distribution of tokens.
- Expanded Section 3.5 with details on the computational requirements for the experiments.
- In Section 4.1.6, added an explanation of how local information was measured using linear classifiers trained to predict token spatial positions.
- Made minor improvements to clarity and figures.
Assigned Action Editor: Lu Jiang
Submission Number: 4353
Loading