Keywords: vision-language model, test-time compute, ensemble learning, visual reasoning
Abstract: Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models, but their applicability to vision–language models (VLMs) remains unclear.
We present a systematic study of TTC for visual reasoning across seven open-source VLMs and six benchmarks, revisiting two paradigms: (i) feature-based scoring of chain-of-thought (CoT) traces and (ii) confidence-based aggregation via majority voting (MV).
In the single-model setting, feature cues (e.g., length, pivot words) fail to improve accuracy, while MV yields only modest, CoT-dependent gains.
To explain this limitation, we theoretically show that the voting method's effectiveness depends on prediction diversity: when outputs are highly correlated, the benefit of voting vanishes.
In contrast, multi-model ensembles introduce stronger diversity through architectural differences, training data, and scale, making them both more realistic and more promising for TTC.
However, MV treats all models equally, leaving it vulnerable to correlated errors from weaker models.
To address this, we propose Entropy-based TTC, which selects the most confident prediction based on predictive entropy.
Our method reduces to MV in the single-model case but, in ensembles, leverages confidence disparities to prioritize stronger models.
We prove that our method theoretically outperforms MV under mild dependence assumptions, and empirically show that it consistently surpasses both MV and the best individual model across diverse visual reasoning benchmarks.
This demonstrates that smaller models can enhance, rather than hinder, larger ones when combined appropriately, unlocking synergistic gains not achievable with existing TTC strategies.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20924
Loading