Reproducibility Study of “Vision Transformers Need Registers”

Reproducibility Study of “Vision Transformers Need Registers”

TMLR Paper4353 Authors

25 Feb 2025 (modified: 14 May 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision Transformers (ViTs) have achieved State-Of-The-Art (SOTA) performance in nu- merous tasks. However, the emergence of high-norm artifact tokens in supervised and self-supervised ViTs hinders interpretability of attention maps of such models. This study reproduces and validates previous work (5) addressing this issue through the use of register tokens - learnable placeholders added to the input sequence - that mitigate artifacts and yield smoother feature maps. We evaluated the presence of artifacts in various ViT models, namely DeiT-III and DINOv2 architectures, and investigated the impact of fine-tuning pre- trained ViTs with register tokens and additional regularization introduced. By conducting experiments on pre-trained and fine-tuned models, we confirm that register tokens eliminate artifact and improve attention map interpretability.

Submission Length: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=w9pgM58H05

Changes Since Last Submission: - Added semantic segmentation experiments on ADE20K using DINOv2-L and DeiT-III-S to evaluate the impact of register tokens. - Included downstream results for DeiT-III-Small with 0, 1, 2, and 4 register tokens; clarified that DINOv2 results use a fixed 4-register-token pretrained backbone. - Added histogram plots (Figure 9) showing the reduction of artifact tokens via fine-tuning, based on the L2 norm distribution of tokens. - Expanded Section 3.5 with details on the computational requirements for the experiments. - In Section 4.1.6, added an explanation of how local information was measured using linear classifiers trained to predict token spatial positions. - Made minor improvements to clarity and figures.

Assigned Action Editor: ~Lu_Jiang1

Submission Number: 4353

Loading