Keywords: Vision-Language Models, ensemble methods, family bias, error correlation, VQA
Abstract: Ensembling Vision-Language Models (VLMs) maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5–3.6 independent voters and create a Misleading tier (1.5–6.5% of questions) where calibrated voting collapses to 0% despite the best model being correct. We propose three family-aware methods: Hierarchical Family Voting (HFV) recovers +18–26 pp on the Misleading tier; QualRCCV, a training-free method, is the first to beat calibrated voting on all three benchmarks (p<0.05); Learned Candidate Scoring (LCS) achieves +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA, all significant, and is the only learned method that never degrades any benchmark.
Submission Number: 3
Loading