Towards Scalable Explainable AI: Using Vision-Language Models to Interpret Vision Systems

TMLR Paper7571 Authors

18 Feb 2026 (modified: 21 Apr 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Explainable AI (xAI) is increasingly important for the trustworthy deployment of vision models in domains such as medical imaging, autonomous driving, and safety-critical systems. However, while saliency maps provide useful information about vision models, current explainable AI (xAI) methods remain bottlenecked by manual inspection of saliency maps or explaining them sample by sample without aggregation to explain their behaviors on large datasets, which makes large-scale analysis time-consuming and subjective. To address this, we propose a scalable automated pipeline that leverages Vision-Language Models (VLMs) to evaluate saliency-based explanations at both sample and dataset levels. Our method uses masked CAM images and prompts VLMs to generate descriptions for each sample, score attention quality, and aggregates results into a confusion-matrix framework for systematic analysis. We validate the pipeline on COCO, ImageNet, and PASTA datasets to show the method's ability and reliability. The result shows that our pipeline achieves 0.78 Pearson correlation with human judgments, outperforming traditional metrics and other xAI methods in performance, usefulness, and human alignment. We also show that the framework enables practical applications such as detecting mislabeled/incorrect samples with a 0.893 F1-score on COCO and a 0.885 F1-score on ImageNet, demonstrating its utility for scalable model evaluation and data auditing.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Magda_Gregorova2
Submission Number: 7571
Loading