Keywords: multimodal learning, llm evaluation, explainable ai, affective computing, emotion recognition, responsible ai, benchmark, diagnostic evaluation
TL;DR: We propose a two-stage protocol that diagnoses modality conflicts in multimodal LLMs to enable a fairer, reasoning-focused evaluation beyond simple accuracy.
Abstract: Current benchmarks for Multimodal Large Language Models (MLLMs) rely on single-accuracy scores, a metric that is fundamentally flawed for subjective tasks like emotion recognition. This paradigm creates an "Intelligence-Accuracy Paradox," where models with sophisticated reasoning about ambiguous human communication are penalized for not conforming to a single, oversimplified ground-truth label, while less intelligent models that exploit dataset biases can achieve higher scores. This paper argues that high accuracy often masks a "hidden failure" on complex, ambiguous instances. To address this, we propose a new, two-stage protocol that is both diagnostic and evaluative. Stage 1 acts as a diagnostic, using a phenomenon we term Dominant Modality Override (DMO), where one modality’s high-confidence signal hijacks the final decision to automatically partition a dataset into unambiguous and conflict-rich samples. This diagnosis enables Stage 2, a fairer evaluation where these partitions are assessed differently: unambiguous samples are scored on accuracy, while conflict-rich samples are evaluated on the quality of their reasoning using metrics like clue-based faithfulness and set-based
plausibility. This protocol provides a fairer, more faithful "report card" of a model’s true capabilities, rewarding intelligent reasoning over brittle pattern matching.
Submission Number: 242
Loading