Understanding the Limits of Vision Test-Time Scaling: Path Redundancy, Instance Difficulty, and Adaptive Compute
Keywords: Vision Test-Time Scaling, Test-Time Com- pute, Multi-Path Inference, CLIP, Zero-Shot Classification, Adaptive Inference, Path Diversity, Inference Redundancy, Compute-Accuracy Trade-offs, Information Scaling
TL;DR: Vision test-time scaling improves accuracy only when additional inference paths provide diverse information; otherwise, high path redundancy causes rapid saturation.
Abstract: Test-time scaling has shown strong gains in language rea-
soning, yet its behavior in vision remains poorly under-
stood. We present one of the first systematic studies of vi-
sion test-time scaling through CLIP-based multi-path in-
ference, where computation is increased via prompt en-
sembles and test-time augmentations. Our results show
that additional inference paths improve accuracy in early
regimes but rapidly exhibit diminishing returns. Through
correlation analysis, we demonstrate that strong path re-
dundancy limits the effective value of additional compu-
tation. We further show that compute gains concentrate
on high-uncertainty samples, motivating adaptive infer-
ence strategies. Although entropy-based adaptive stop-
ping approaches favorable compute-accuracy trade-offs,
our analysis reveals substantial remaining efficiency head-
room. Overall, our findings suggest that the primary bottle-
neck of vision test-time scaling is not computation itself, but
the lack of informational diversity across inference paths.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 2
Loading