Keywords: Vision Transformers, Computer Vision
TL;DR: We changed Patch Embeddings from Pre-trained ViTs using their relative distances from each other to improve performance on downstream tasks like classification.
Abstract: Positional encodings in Vision Transformers, relative (iRPE, ROPE) or otherwise, help to reason about space but remain content-agnostic. We introduce a lightweight, content-aware patch modulation that injects a quasi-positional prior computed from pre-trained patch embeddings. We present two light weight, drop-in pre-MHSA modules: RADAR (anchor-conditioned distance priors that modulate tokens) and PFIM (parameter-free importance scaling with no new trainable parameters beyond the logit layer). Both keep the ViT backbone frozen, preserve the attention kernel, and add negligible to no overhead.
On CIFAR-100 with absolute positional encoding, RADAR boosts Top-1 accuracy by +7.5 pp and Top-5 by +3.3 pp over vanilla ViT, and by +4.1 pp / +1.6 pp over a strong single-CPE baseline. PFIM improves vanilla ViT by +2.0 pp(Top-1) and +1.1 pp (Top-5), performing on par with Single-PEG within a small margin. Improvements are statistically significant across seeds (paired t-test, 95% CI). RADAR contains 56% and PFIM 88% , fewer trainable params compared to Single-PEG on CIFAR100. By turning latent patch geometry into content-aware priors, our approach reallocates attention to semantically relevant regions, offering parameter-efficient gains ideal for low-budget training.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 20987
Loading