Alias-Free ViT: Fractional Shift Invariance via Linear Attention

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Transformers, alias‑free, aliasing, anti‑aliasing, shift invariance, fractional shifts, linear attention, cross‑covariance attention, translation robustness, ImageNet, XCiT
TL;DR: We introduce an alias‑free ViT that combines anti‑aliasing with linear cross‑covariance attention to achieve fractional shift invariance, delivering ~99% consistency to sub‑pixel shifts and stronger translation robustness with competitive accuracy.
Abstract: Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation‑invariant and are more sensitive to minor image translations than standard convnets. Previous studies have shown, however, that convnets are also not perfectly shift‑invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti‑aliasing approaches have been proposed to certify convnets translation robustness. Building on this line of work, we propose an Alias‑Free ViT, which combines two main components. First, it uses alias-free downsampling and nonlinearities. Second, it uses linear cross‑covariance attention that is shift‑equivariant to both integer and fractional translations, enabling a shift-invariant global representation. Our model maintains competitive performance in image classification and outperforms similar‑sized models in terms of robustness to adversarial translations.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 10877
Loading