Keywords: Vision Transformers, Test-time Scaling, Token-level Attention, Attention Budget Scheduling, Efficient Inference, CIFAR-100, CLEVR, Model Calibration, Robustness
TL;DR: Post-hoc token-level scaling for Vision Transformers improves accuracy and calibration with minimal extra computation.
Abstract: Test-time scaling enables vision models to improve inference performance without retraining by selectively allocating computation. Existing methods typically scale computation uniformly—via higher-resolution inputs, multi-crop ensembles, or extra sampling steps—ignoring spatial redundancy. We introduce Attention Budget Scheduling (ABS), a token-level test-time scaling method for Vision Transformers (ViTs) that reallocates attention computation toward uncertain or high-saliency tokens while leaving less informative regions unchanged. ABS operates post-hoc and requires no retraining. Evaluations on CIFAR-100 and CLEVR show modest but consistent improvements: ABS achieves up to 1.21\% higher accuracy on CIFAR-100 with only 10\% additional FLOPs, compared to resolution scaling requiring 69\% more FLOPs for 0.77\% gain, while also improving calibration. These results highlight token-level scaling as an efficient and practical approach for enhancing ViT inference.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 1
Loading