Reproducibility Study of “Vision Transformers Need Registers"

TMLR Paper4863 Authors

15 May 2025 (modified: 08 Sept 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision Transformers (ViTs) have achieved State-Of-The-Art (SOTA) performance in numerous tasks. However, the emergence of high-norm artifact tokens in supervised and self-supervised ViTs hinders interpretability of attention maps of such models. This study reproduces and validates previous work addressing this issue through the use of register tokens - learnable placeholders added to the input sequence - that mitigate artifacts and yield smoother feature maps. We evaluated the presence of artifacts in various ViT models, namely DeiT-III and DINOv2 architectures, and investigated the impact of fine-tuning pre-trained ViTs with register tokens and additional regularization introduced. By conducting experiments on pre-trained and fine-tuned models, we confirm that register tokens eliminate artifact and improve attention map interpretability.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=w9pgM58H05
Changes Since Last Submission: This work is submitted as part of the MLRC Reproducibility Challenge. We have addressed the reviewers' and AE's requested changes (training from scratch using the DeiT-III Small model and provided clarifications across several sections of the paper).
Assigned Action Editor: ~Lu_Jiang1
Submission Number: 4863
Loading