Keywords: Explainability, MLLMs, LLMs
TL;DR: We introduce NEMO, a benchmark for natural-language explanations of vision model errors, and SciTX, an MLLM-based method that selects explanations whose counterfactual interventions shift the model's prediction toward the ground-truth class.
Abstract: With the rise of agentic LLM systems, non-experts increasingly interact with vision classifiers through natural language. When a classifier misclassifies an image, users need a faithful account of \emph{why}. Such explanations help users diagnose failure modes and debug the model. Progress on this need is blocked by two gaps. First, no benchmark evaluates free-form natural-language explanations of vision model errors. Second, existing retrieval-based methods are limited to a fixed corpus of error sentences and cannot describe failure modes outside it. We address both gaps. We introduce NEMO, a task and benchmark, paired with an LLM-as-a-Judge protocol that scores explanations for whether they describe the failure factor. We then propose SciTX, a generation-based method powered by Multi-modal Large Language Models (MLLMs). SciTX is composed of a four-stage pipeline: observation, hypothesis, experiment, and conclusion. The pipeline retrieves contrastive observations, generates candidate hypotheses, validates each via a counterfactual intervention, and selects the hypothesis whose intervention shifts the model's prediction most toward the ground-truth class. SciTX outperforms retrieval-based and MLLM-augmented baselines. A human study with AI practitioners also ranks SciTX first.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 161
Loading