Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models

Muhammad Imran; Yugyung Lee

Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models

Muhammad Imran, Yugyung Lee

Published: 14 Jun 2025, Last Modified: 16 Aug 2025MKLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Submission Type: Non-archive

Keywords: Multimodal Explainability, Vision-Language Models, Gradient-Based Attribution, Attention-Based Reasoning, Context-Aware Interpretability

TL;DR: We propose MMEL, a gradient-based framework that enhances vision–language model interpretability by integrating multi-scale attention reasoning and semantic relationship modeling for more faithful, context-aware explanations.

Abstract: Recent advances in vision-language models have significantly expanded the frontiers of automated image analysis. However, applying these models in safety-critical contexts remains challenging due to the complex relationships between objects, subtle visual cues, and the heightened demand for transparency and reliability. This paper presents the \emph{Multi-Modal Explainable Learning} (MMEL) framework, designed to enhance the interpretability of vision-language models while maintaining high performance. Building upon prior work in gradient-based explanations for transformer architectures (Grad-eclip), MMEL introduces a novel \emph{Hierarchical Semantic Relationship Module} that enhances model interpretability through multi-scale feature processing, adaptive attention weighting, and cross-modal alignment. Our approach processes features at multiple semantic levels to capture relationships between image regions at different granularities, applying learnable layer-specific weights to balance contributions across the model's depth. This results in more comprehensive visual explanations that highlight both primary objects and their contextual relationships with improved precision. Through extensive experiments on standard datasets, we demonstrate that by incorporating semantic relationship information into gradient-based attribution maps, MMEL produces more focused and contextually-aware visualizations that better reflect how vision-language models process complex scenes. The MMEL framework generalizes across various domains, offering valuable insights into model decisions for applications requiring high interpretability and reliability.

Submission Number: 20

Loading