Auditing the Performance and Calibration of Multi-Modal Large Language Models

Published: 13 Apr 2026, Last Modified: 13 Apr 2026Calibration for Modern AI @ AISTATS 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Uncertainty Quantification, Calibration, Multi-Modal
TL;DR: We analyze how estimates of calibration shift when applying multi-modal LLMs to generative versus classification tasks
Abstract: The impressive accuracy scores of Multi-modal Large Language Models (MLLMs) on visual multiple-choice question answering (MCQA) tasks only begins to measure their readiness for sensitive domains such as medicine, scientific research, and multi-modal analytics. Here, we conduct an analysis using uncertainty quantification (UQ) methods for text generation to probe the robustness and calibration underlying the strong performance of MLLMs on image QA benchmarks. Among several findings, we show that model calibration shifts drastically when comparing UQ metrics in the classification versus the open-ended, generative setting increasingly employed by MLLMs. Based on our analysis, we suggest that the path to robust, deployable MLLMs requires not only achieving high accuracy on benchmarks, but also improving performance and calibration on challenging, open-ended tasks across the multi-modal spectrum.
Submission Number: 24
Loading