Keywords: multimodal, interpretability, monosemanticity
Abstract: Humans experience the world through multiple modalities, such as, vision, language, and speech, making it natural to explore the commonality and distinctions among them. In this work, We take a data-driven approach to address this question by analyzing \textbf{interpretable, monosemantic features} extracted from deep multimodal models. Specifically, we introduce the Modality Dominance Score (MDS) to attribute each multimodal feature to a specific modality. We then map the features into a more interpretable space, enabling us to categorize them into three distinct classes: vision features (single-modal), language features (single-modal), and visual-language features (cross-modal). Interestingly, this data-driven categorization closely aligns with human intuitive understandings of different modalities. We further show that this modality decomposition can benefit multiple downstream tasks, including reducing bias in gender detection, generating cross-modal adversarial examples, and enabling modal-specific feature control in text-to-image generation. These results indicate that large-scale multimodal models, when equipped with task-agnostic interpretability tools, can offer valuable insights into the relationships between different data modalities.
Primary Area: interpretability and explainable AI
Submission Number: 20907
Loading