On the Reproducibility of Vision Transformers Need Registers

TMLR Paper4382 Authors

28 Feb 2025 (modified: 10 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Training Vision Transformers (ViTs) presents significant challenges, one of which is the emergence of artifacts in attention maps, hindering their interpretability. Darcet et al. (2024) investigated this phenomenon and attributed it to the need for ViTs to store global information beyond the [CLS] token. They proposed a novel solution involving the addition of empty input tokens, named registers, which successfully eliminate artifacts and improve the clarity of attention maps. In this work, we reproduce the findings of Darcet et al. (2024) and evaluate the generalizability of their claims across multiple models, including DINO, DINOv2, OpenCLIP, and DeiT3. While we confirm the validity of several of their key claims, our results reveal that some claims do not extend universally to other models. Additionally, we explore the impact of model size, extending their findings to smaller models. Finally, we untie terminology inconsistencies found in the original paper and explain their impact when generalizing to a wider range of models.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=4VonR2EPrf
Changes Since Last Submission: Revision 1: We have expanded and clarified several sections in response to the reviewers’ feedback. In Section 3.3, we now include a methodology for architectures without a CLS token (Claims 1, 2, and 4), and we’ve added two missing citations. Section 3.4 has been augmented with descriptions of the Swin and PVT vision transformers. In Section 3.6, the reported $CO_2$ emissions have been updated to reflect the extra compute used for our rebuttal experiments. The former Section 4 has been split into two subsections: * **4.1** retains the original analyses for vanilla ViTs. * **4.2** presents the newly added results for PVT and Swin models, shown separately for clarity. In Section 5.2, we integrate these new PVT and Swin findings with our existing conclusions. - **Tables** 1–3 now include PVT and Swin results; Table 1’s caption has been improved and bolding errors corrected; Table 2 adds the L2-distance gap between outliers and normal samples; Tables 4 and 5 are converted to a side-by-side layout to save space. - **Figures** 2 and 3 have been expanded—Figure 2 is now landscape with two extra sample images; Figure 3 adds PVT-b2 and Swin-Small plots. We’ve also introduced Figure 6 (feature-map visualizations for PVT and Swin) and Figure 7 (L2-norm distributions for PVT variants). **Appendix**: * **B** offers an in-depth analysis of PVT behavior. * **C** compiles all feature-map plots across every model. Revision 2: Figure 1 caption: added reference to the Appendix C which includes a more in-depth analysis of the relationship between tokens' L2 norm distribution and artifacts. Table 1: added a baseline column which displays the trivial solution - always predicting middle - to the position prediction task, as per the reviewer’s comments. Updated the caption. Fixed minor typographical errors. [new] Appendix C: To address the reviewer’s comments we added a new section in the Appendix and a new figure (12). Both serve to help visualize the connection between tokens’ L2 norm distribution and noise/artifacts in feature maps. The previous Appendix C is now Appendix D.
Assigned Action Editor: ~Tae-Hyun_Oh3
Submission Number: 4382
Loading