Auditing the Performance and Calibration of Multi-Modal Large Language Models

Brendan Kennedy; Lauren Phillips; Sai Munikoti; Sameera Horawalavithana; Ian Stewart; Karl Pazdernik

Auditing the Performance and Calibration of Multi-Modal Large Language Models

Brendan Kennedy, Lauren Phillips, Sai Munikoti, Sameera Horawalavithana, Ian Stewart, Karl Pazdernik

Published: 13 Apr 2026, Last Modified: 13 Apr 2026Calibration for Modern AI @ AISTATS 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Uncertainty Quantification, Calibration, Multi-Modal

TL;DR: We analyze how estimates of calibration shift when applying multi-modal LLMs to generative versus classification tasks

Abstract: The impressive accuracy scores of Multi-modal Large Language Models (MLLMs) on visual multiple-choice question answering (MCQA) tasks only begins to measure their readiness for sensitive domains such as medicine, scientific research, and multi-modal analytics. Here, we conduct an analysis using uncertainty quantification (UQ) methods for text generation to probe the robustness and calibration underlying the strong performance of MLLMs on image QA benchmarks. Among several findings, we show that model calibration shifts drastically when comparing UQ metrics in the classification versus the open-ended, generative setting increasingly employed by MLLMs. Based on our analysis, we suggest that the path to robust, deployable MLLMs requires not only achieving high accuracy on benchmarks, but also improving performance and calibration on challenging, open-ended tasks across the multi-modal spectrum.

Submission Number: 24

Loading