SPoT: Subpixel Placement of Tokens in Vision Transformers

SPoT: Subpixel Placement of Tokens in Vision Transformers

TMLR Paper6378 Authors

04 Nov 2025 (modified: 09 Mar 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.

Submission Type: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: 6169

Changes Since Last Submission: In the final camera ready submission we have addressed the concerns raised by the AE. Specifically: > The authors should clarify how the model parameters $\theta$ are optimized when using SPoT. We thank the AE for the suggestion. We look to clarify the distinction between the optimization of the parameters $\theta$ performed in retrofitting, as opposed to the optimization of positions $S$ performed with SPoT-ON. --- Explicitly, - **Sec. 2, page 3**: We specify that the SFS objective (Eq. 1) is to find optimal positions on a per image basis, i.e. positions are derived per image. - **Sec. 3.2, page 4**: We specify that the optimization of positions with SPoT-ON is performed with independent positions per image. For maximum disambiguation, we emphasize this in multiple locations in this subsection. - **Sec. 5.2, page 10-11**: We outline the retrofitting step where we optimize model parameters $\theta$ more explicitly. We pose the optimization objective with an explicit expectation over images in the dataset, and outline the role of spatial positions $S$ as sampled subpixel positions over each image. --- In addition to the corrections based on reviewer feedback in the rebuttal, we hope this resolves the remaining ambiguities. We thank the AE and all reviewers for a very productive rebuttal full of insights that has helped us communicate our main ideas with clarity.

Code: https://github.com/dsb-ifi/SPoT

Assigned Action Editor: ~Hankook_Lee1

Submission Number: 6378

Loading