GLEN: Generalized Focal Loss Ensemble of Low-Rank Networks for Calibrated Visual Question Answering

Mahsa Mozaffari, Hitesh Sapkota, Qi Yu

Published: 04 Mar 2025, Last Modified: 04 Mar 2025AAAI 2025EveryoneCC BY 4.0

Abstract: Deep learning models with large-scale backbones have been increasingly adopted to tackle complex visual question answering (VQA) problems in real settings. While providing powerful learning capacities to handle the high-dimensional and multimodal VQA data, these models tend to suffer from the memorization effect leading to overconfident predictions. This can significantly limit their applicability in critical domains (e.g., medicine, cyber-security, and public safety), where confidently wrong predictions may lead to severe consequences. In this work, we propose to perform novel low-rank network factorization, resulting in much better-calibrated networks. These low-rank factorized networks are then aggregated into an ensemble guided by a generalized focal loss to further improve the overall performance and calibration. The overall framework, referred to as the Generalized focal Loss Ensemble of low-rank Networks (GLEN), is an important step toward developing well-calibrated VQA models. We demonstrate the effectiveness of our low-rank factorization-based ensemble technique towards calibrating VQA models, theoretically demonstrate that the generalized focal loss provides a more balanced bias-variance trade-off, which guarantees to lower the confidence of the incorrect predictions. Extensive experimentation conducted on benchmark datasets and comparison on various VQA models shows that GLEN leads to much better calibration over both in-distribution and out-of-distribution data without sacrificing the VQA accuracy.