Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

TMLR Paper6027 Authors

28 Sept 2025 (modified: 19 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Empirical evidence shows that deep vision networks represent concepts as directions in latent space, vectors which we call concept embeddings. For each concept, a latent factor—a scalar—indicates the degree of its presence in an input patch. For a given patch, the latent factors of multiple concepts are encoded into a compact vector representation by linearly combining concept embeddings, with the latent factors serving as coefficients. Since these embeddings enable such encoding, we refer to them as encoding directions. A latent factor can be recovered from the representation by taking the inner product with a filter, a vector which we call a decoding direction. These encoding-decoding direction pairs are not directly accessible, but recovering them unlocks significant potential to open the black-box nature of deep networks, enabling understanding, debugging, and improving deep learning models. Decoding directions help attribute meaning to latent codes, while encoding directions help assess the influence of the concept on the predictions, and both directions may assist model correction by unlearning concepts irrelevant to the network's prediction task. Compared to previous matrix decomposition, autoencoder, and dictionary learning approaches which rely on the reconstruction of feature activations, we propose a different perspective to learn these direction pairs. We base identifying the decoding directions on directional clustering of feature activations and introduce signal vectors to estimate encoding directions under a probabilistic perspective. Unlike most other works, we also take advantage of the knowledge encoded in the weights of the network to guide our direction search. For this, we illustrate that a novel technique called \textit{Uncertainty Region Alignment} can exploit this knowledge to effectively reveal interpretable directions that influence the network's predictions. We perform a thorough and multifaceted comparative analysis to offer insights on the fidelity of direction pairs, the advantages of the method compared to other unsupervised direction learning approaches, and how the learned directions compare in relation to those learned with supervision. We find that: a) In controlled settings with synthetic data, our approach is effective in recovering the ground-truth encoding-decoding direction pairs; b) In real-world settings, the decoding directions correspond to monosemantic interpretable concepts, often scoring substantially better in interpretability metrics than other unsupervised baselines; c) In the same settings, signal vectors are faithful estimators of the concept encoding directions validated with a novel approach based on activation maximization. At the application level, we provide examples that demonstrate how the learned directions can help to a) understand global model behavior; b) explain individual sample predictions in terms of local, spatially-aware, concept contributions; and c) intervene on the network's prediction strategy to provide either counterfactual explanations or correct erroneous model behavior.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Francisco_J._R._Ruiz1

Submission Number: 6027

Loading