Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Alexandros Doumanoglou; Kurt Driessens; Dimitrios Zarpalas

Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Alexandros Doumanoglou, Kurt Driessens, Dimitrios Zarpalas

23 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY 4.0

TL;DR: The proposed unsupervised method identifies a pair of latent space directions (filter and signal) with the first being able to answer questions of interpretability and the second to answer questions of concept influence on model's predictions

Abstract: Latent space directions have played a key role in understanding, debugging, and improving deep learning models, since concepts are encoded in directions of the feature space as superpositions. The encoding direction of a concept maps a latent factor to a feature component, while the decoding direction retrieves it. These encoding-decoding direction pairs unlock significant potential to open the black-box nature of deep networks. Decoding directions help attribute meaning to latent codes, while encoding directions help assess the influence of the concept on the predictions, and both directions may assist in unlearning irrelevant concepts. Compared to previous autoencoder and dictionary learning approaches, we offer a different perspective in learning these direction pairs. We base the decoding direction on unsupervised interpretable basis learning and introduce signal vectors to estimate encoding directions. Meanwhile, we empirically prove that the uncertainty region of the model is informative and can be used to effectively reveal meaningful and influential concepts that impact model predictions. Tests on synthetic data show the approach's efficacy in recovering the underlying encoding-decoding direction pairs in a controlled setting, while experiments on state-of-the art deep image classifiers show notable improvements, or competitive performance, in interpretability and influence, compared to previous unsupervised and even supervised direction learning approaches.

Primary Area: Deep Learning

Keywords: latent space, interpretability, concepts, directions, signals, patterns, distractors

Submission Number: 10034

Loading