Towards Scalable Explainable AI: Using Vision-Language Models to Interpret Vision Systems

Towards Scalable Explainable AI: Using Vision-Language Models to Interpret Vision Systems

TMLR Paper7571 Authors

18 Feb 2026 (modified: 22 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Explainable AI (xAI) is increasingly important for the trustworthy deployment of vision models in domains such as medical imaging, autonomous driving, and safety-critical systems. However, modern vision models are typically trained on massive datasets, making it nearly impossible for researchers to manually track how models learn from each sample, especially when relying on saliency maps that require intensive visual inspection. Traditional xAI methods, while useful, often focus on the instance-level explanation and risk losing important information about model behavior at scale, leaving analysis time-consuming, subjective, and difficult to reproduce. To overcome these challenges, we propose an automated evaluation pipeline that leverages Vision-Language Models to analyze vision models at both the sample and dataset levels. Our pipeline systematically assesses, generates, and interprets saliency-based explanations, aggregates them into structured summaries, and enables scalable discovery of failure cases, biases, and behavioral trends. By reducing reliance on manual inspection while preserving critical information, the proposed approach facilitates more efficient and reproducible xAI research, supporting the development of robust and transparent vision models.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Magda_Gregorova2

Submission Number: 7571

Loading