Space Filling Curves as Spatial Priors for Small or Data-Scarce Vision Transformers

Space Filling Curves as Spatial Priors for Small or Data-Scarce Vision Transformers

ICLR 2026 Conference Submission20303 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: space filling curves, ViT, spatial priors

TL;DR: A new attention mechanism for vision backbones using Space Filling Curves improving both fine-tuning and pre-training of ViTs.

Abstract: Vision Transformers (ViTs) have become a dominant backbone in computer vision, yet their attention mechanism lacks inherent spatial inductive biases, which are especially crucial in small models and low-data regimes. Inspired by the masking in Linear Transformers and the scanning patterns of Vision SSMs, we propose VIOLIN, a lightweight masked attention mechanism that integrates Space Filling Curves (SFCs) to enhance spatial awareness with negligible computational overhead. VIOLIN scans the input image with multiple SFCs to build curve specific decay masks, which are averaged and multiplied with the attention matrix to encode spatial relationships. It yields notable gains in data-scarce settings: when fine-tuning on VTAB-1K, VIOLIN improves accuracy by up to 8.7% on the Structured group, and it can be combined with parameter-efficient tuning methods such as LoRA. Beyond fine-tuning, VIOLIN consistently improves various tiny or small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K, achieving gains of up to 0.9\% on on ImageNet-1K and 7.2\% on pixel level CIFAR-100. Overall, VIOLIN offers a computationally efficient yet effective way to inject spatial inductive bias into ViTs, particularly benefiting small models and data-scarce scenarios.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 20303

Loading