Keywords: Unified Multi-round Counterfactual Inference, Robustness, Bias
TL;DR: We propose a Unified Multi-round Counterfactual Inference framework together with a diagnostic benchmark to improve the robustness of Large Vision-Language Models in real-world downstream tasks.
Abstract: Integrating Large Language Models into vision-language frameworks has led to the rise of powerful Large Vision-Language Models (LVLMs). However, this integration introduces two critical robustness challenges: language bias and language sensitivity. To address these issues, we propose a Unified Multi-round Counterfactual Inference (UMCI) framework, which generalizes and extends prior methods like Counterfactual VQA and Visual Contrastive Decoding. UMCI performs multiple rounds of counterfactual inference using both textual and visual perturbations to mitigate bias and enhance consistency. This process reveals a novel test-time scaling law: increasing the number of counterfactual rounds consistently improves robustness. We also notice that non-robust samples are not fixed across different LVLMs. To disentangle the effects of the proposed inference algorithm from the confounding effect introduced by the base models, we introduce the dynamic Bias and Sensitivity Benchmark (BS Benchmark) as an adaptive evaluation tool specifically designed to probe robustness issues tailored to each LVLM. Our experiments demonstrate that UMCI significantly improves robustness on BS Benchmark while enhancing or at least maintaining the performance on standard benchmarks such as MMBench-CN/EN, MME, MMStar, CCBench, and ViLP. Extensive experimental results indicate that UMCI is scalable, generalizable, and offers a promising path toward robust multimodal reasoning.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9404
Loading