Abstract: Vision Transformers (ViTs) have achieved State-Of-The-Art (SOTA) performance in numerous tasks. However, the emergence of high-norm artifact tokens in supervised and self-supervised ViTs hinders interpretability of attention maps of such models. This study reproduces and validates previous work addressing this issue through the use of register tokens - learnable placeholders added to the input sequence - that mitigate artifacts and yield smoother feature maps. We evaluated the presence of artifacts in various ViT models, namely DeiT-III and DINOv2 architectures, and investigated the impact of fine-tuning pre-trained ViTs with register tokens and additional regularization introduced. By conducting experiments on pre-trained and fine-tuned models, we confirm that register tokens eliminate artifact and improve attention map interpretability.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=w9pgM58H05
Changes Since Last Submission: This work is submitted as part of the MLRC Reproducibility Challenge.
In this updated submission, we have clarified in the Introduction that our choice to fine-tune (rather than train from scratch) was due to limited computational resources (a common constraint in academic settings). We believe this makes our findings especially relevant for researchers with restricted access to compute. Our primary goal remains to replicate and validate the original paper’s results in a more resource-efficient setting.
Assigned Action Editor: ~Lu_Jiang1
Submission Number: 4863
Loading