['3c3', '< Abstract: Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.', '---', '> Abstract: Vision Transformers (ViTs) have demonstrated remarkable capabilities in visual representation learning. However, we identify and characterize pervasive "artifact" tokens—high-norm features frequently appearing in uninformative background regions—which models repurpose for internal computations. These artifacts degrade performance and interpretability. We introduce a novel and simple solution: "Registers," dedicated learnable tokens appended to the input sequence. This effective approach entirely eliminates these artifacts across supervised and self-supervised ViTs. Our method achieves new state-of-the-art results for self-supervised dense prediction, significantly enhances object discovery, and produces demonstrably smoother, more interpretable feature and attention maps, improving ViT utility for diverse downstream tasks.', '5c5', '< Section: ', '---', '> Section: REGISTERS FOR VISION TRANSFORMERS', '9,15c9', '< Embedding images into generic features that can serve multiple purposes in computer vision has been a long-standing problem. First methods relied on handcrafted principles, such as SIFT (Lowe, 2004), before the scale of data and deep learning techniques allowed for end-to-end training. Pursuing generic feature embeddings is still relevant today, as collecting valuable annotated data for many specific tasks remains difficult. This difficulty arises because of the required expertise (e.g., medical data, or remote sensing) or the cost at scale. Today, it is common to pretrain a model for a task for which plenty of data is available and extract a subset of the model to use as a feature extractor. Multiple approaches offer this possibility; supervised methods, building on classification We consider ViTs trained with label supervision (DeiT-III), text-supervision (OpenCLIP) or selfsupervision (DINO and DINOv2). Interestingly, all models but DINO exhibit peaky outlier values in the attention maps. The goal of this work is to understand and mitigate this phenomenon.', '< or text-image alignment, allow training strong feature models to unlock downstream tasks. Alternatively, self-supervised methods building on the Transformer architecture have attracted significant attention due to their high prediction performance on downstream tasks and the intriguing ability of some models to provide unsupervised segmentations (Caron et al., 2021) In particular, the DINO algorithm is shown to produce models that contain explicit information about the semantic layout of an image. Indeed, qualitative results show that the last attention layer naturally focuses on semantically consistent parts of images and often produces interpretable attention maps. Exploiting these properties, object discovery algorithms such as LOST (Siméoni et al., 2021) build on top of DINO. Such algorithms can detect objects without supervision by gathering information in attention maps. They are effectively unlocking a new frontier in computer vision.', '< DINOv2 (Oquab et al., 2023), a follow-up to DINO, provides features that allow tackling dense prediction tasks. DINOv2 features lead to successful monocular depth estimation and semantic segmentation with a frozen backbone and linear models. Despite the strong performance on dense tasks, we observed that DINOv2 is surprisingly incompatible with LOST. When used to extract features, it delivers disappointing performance, only on par with supervised alternative backbones in this scenario. This suggests that DINOv2 behaves differently than DINO. The investigation described in this work notably exposes the presence of artefacts in the feature maps of DINOv2 that were not present in the first version of this model. These are observable qualitatively using straightforward methods. Also surprisingly, applying the same observations to supervised vision transformers exposes similar artifacts, as shown in Fig. 2. This suggests that DINO is, in fact, an exception, while DINOv2 models match the baseline behavior of vision transformers.', '< In this work, we set out to better understand this phenomenon and develop methods to detect these artifacts. We observe that they are tokens with roughly 10x higher norm at the output and correspond to a small fraction of the total sequence (around 2%). We also show that these tokens appear around the middle layers of the vision transformer, and that they only appear after a sufficiently long training of a sufficiently big transformer. In particular, we show that these outlier tokens appear in patches similar to their neighbors, meaning patches that convey little additional information.', '< As part of our investigation, we evaluate the outlier tokens with simple linear models to understand the information they contain. We observe that, compared to non-outlier tokens, they hold less information about their original position in the image or the original pixels in their patch. This ob- We observe that DINOv2 has a few outlier patches, whereas DINO does not present these artifacts. For DINOv2, although most patch tokens have a norm between 0 and 100, a small proportion of tokens have a very high norm. We measure the proportion of tokens with norm larger than 150 at 2.37%.', '< servation suggests that the model discards the local information contained in these patches during inference. On the other hand, learning an image classifier on outlier patches yields significantly stronger accuracy than doing so on the other patches, suggesting that they contain global information about the image. We propose the following interpretation to these elements: the model learns to recognize patches containing little useful information, and recycle the corresponding tokens to aggregate global image information while discarding spatial information.', '< This interpretation is consistent with an inner mechanism in transformer models that allows performing computations within a restricted set of tokens. In order to test this hypothesis, we append additional tokens -that we call registers -to the token sequence, independent of the input image. We train several models with and without this modification and observe that the outlier tokens disappear from the sequence entirely. As a result, the performance of the models increases in dense prediction tasks, and the resulting feature maps are significantly smoother. These smooth feature maps enable object discovery methods like LOST mentioned above with the updated models.', '---', '> The pursuit of generic, multipurpose image features is a cornerstone of computer vision, driven by the high cost and specialized expertise often required for collecting annotated data in various domains (e.g., medical imaging, remote sensing). Historically, this quest evolved from handcrafted features like SIFT (Lowe, 2004) to end-to-end deep learning techniques. Modern approaches often involve pretraining models on large datasets, then extracting features for diverse downstream tasks. Both supervised methods, leveraging classification or text-image alignment, and self-supervised methods, particularly those based on the Transformer architecture, have demonstrated remarkable success in this paradigm.', '16a11,20', '> Self-supervised Vision Transformers (ViTs) have garnered significant attention due to their strong performance and the unexpected ability of some models, like DINO (Caron et al., 2021), to produce unsupervised segmentations. DINO models are known for generating interpretable attention maps that naturally highlight semantically consistent image regions. This property has been instrumental for object discovery algorithms such as LOST (Siméoni et al., 2021), which exploit attention maps to detect objects without explicit supervision, thereby opening new avenues in computer vision.', '> ', '> However, a follow-up, DINOv2 (Oquab et al., 2023), while excelling in dense prediction tasks like monocular depth estimation and semantic segmentation with frozen backbones, surprisingly exhibits limitations when integrated with object discovery methods like LOST. Its performance with LOST is merely on par with supervised alternatives, suggesting a fundamental behavioral difference from DINO. Our investigation reveals the presence of pervasive "artifact" tokens in DINOv2\'s feature maps—a phenomenon largely absent in the original DINO. Moreover, we demonstrate that similar artifacts are prevalent in supervised ViTs, as illustrated in Fig. 2, suggesting that DINO\'s clean attention maps are an exception rather than the norm among Vision Transformers.', '> ', '> This work aims to thoroughly understand and mitigate these artifact tokens. We characterize them as high-norm outliers (roughly 10x higher than regular tokens), constituting a small fraction (around 2%) of the total sequence. These tokens emerge in the middle layers of the ViT, specifically after sufficient training of larger models. Crucially, they appear in patches that are highly similar to their neighbors, indicating redundant local information.', '> ', '> Further probing reveals that these outlier tokens contain significantly less local information (e.g., about their original position or pixel values) compared to non-outlier tokens, implying the model discards local patch specifics during inference. Conversely, these outlier tokens are highly predictive of global image properties, yielding stronger image classification accuracy. This leads to our central interpretation: the model learns to identify and repurpose redundant local patches into "registers" for aggregating global image information, effectively discarding spatial context.', '> ', '> This interpretation aligns with the internal computational flexibility of Transformer models. To test this hypothesis and provide a solution, we introduce explicit, learnable "register" tokens to the input sequence, independent of the image content. Training models with these registers demonstrates their effectiveness: the artifact tokens disappear entirely from the patch sequence. This architectural modification not only improves performance on dense prediction tasks but also yields significantly smoother feature maps, thereby enabling object discovery methods like LOST to function effectively with these enhanced models.', '> ', '237d240', '< ']
