Registers in Small Vision Transformers: A Reproducibility Study of Vision Transformers Need Registers
Abstract: Recent work has shown that Vision Transformers (ViTs) can produce “high-norm” artifact tokens in attention maps. These artifacts disproportionately accumulate global information, can degrade performance, and reduce interpretability in these models. Darcet et al. (2024) proposed registers—auxiliary learnable tokens—to mitigate these artifacts. In this reproducibility study, we verify whether these improvements extend to smaller ViTs. Specifically, we examine whether high-norm tokens appear in a DeiT-III Small model, whether registers reduce these artifacts, and how registers influence local and global feature representation. Our results confirm that smaller ViTs also exhibit high-norm tokens and registers partially alleviate them, improving interpretability. Although the overall performance gains are modest, these findings reinforce the utility of registers in enhancing ViTs while highlighting open questions about their varying effectiveness across different inputs and tasks. Our code is available at https://anonymous.4open.science/r/regs-small-vits.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Results of DeiT-III Small models with 2, 8 and 16 register tokens were added to 4.2.3 and 4.3.1. Appendix E was updated to accomodate the extra use of compute.
Results of DINOv2 Small models with and without register tokens were added to table 1, as well as figure 1.
The results in table 2 where provided with the standard deviation over the different test runs.
The section 5.1 "limitations" was added.
The importance of studying high-norm tokens in small ViTs was added to the introduction.
High and low norm artifacts are defined in section 1.1.
Assigned Action Editor: ~Xinlei_Chen1
Submission Number: 4375
Loading