ViT Registers and Fractal ViT

ICLR 2025 Workshop ICBINB Submission19 Authors

05 Feb 2025 (modified: 05 Mar 2025)Submitted to ICLR 2025 Workshop ICBINBEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 4 pages)
Keywords: ViT, positional encoding, ImageNet, registers
TL;DR: Register-like tokens and self-similar attention mask don't seem to enhance positional encoding or improve performance of ViT baseline.
Abstract: Drawing inspiration from recent findings including surprisingly decent performance of transformers without positional encoding (NoPE) in the domain of language models and how registers (additional throwaway tokens not tied to input) may improve the performance of large vision transformers (ViTs), we invent and test a variant of ViT called fractal ViT that breaks permutation invariance among the tokens by applying an attention mask between the regular tokens and "summary tokens" similar to registers, in isolation or in combination with various positional encodings. These models do not improve upon the baseline performance, highlighting the fact that these findings may be scale, domain, or application-specific.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 19
Loading