Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Zacharie Bugaud

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Zacharie Bugaud

Published: 24 Apr 2026, Last Modified: 01 Jun 2026VisCon 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision-Language Models, ensemble methods, family bias, error correlation, VQA

Abstract: Ensembling Vision-Language Models (VLMs) maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5–3.6 independent voters and create a Misleading tier (1.5–6.5% of questions) where calibrated voting collapses to 0% despite the best model being correct. We propose three family-aware methods: Hierarchical Family Voting (HFV) recovers +18–26 pp on the Misleading tier; QualRCCV, a training-free method, is the first to beat calibrated voting on all three benchmarks (p<0.05); Learned Candidate Scoring (LCS) achieves +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA, all significant, and is the only learned method that never degrades any benchmark.

Submission Number: 3

Loading