Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Mateo Espinosa Zarlenga; Gabriele Dominici; Pietro Barbiero; Zohreh Shams; Mateja Jamnik

Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts

Mateo Espinosa Zarlenga, Gabriele Dominici, Pietro Barbiero, Zohreh Shams, Mateja Jamnik

Published: 01 May 2025, Last Modified: 04 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show that current concept-based architectures are unable to properly intake concept interventions when inputs are OOD, and we propose a novel architecture to address this.

Abstract: In this paper, we investigate how concept-based models (CMs) respond to out-of-distribution (OOD) inputs. CMs are interpretable neural architectures that first predict a set of high-level *concepts* (e.g., "stripes", "black") and then predict a task label from those concepts. In particular, we study the impact of *concept interventions* (i.e., operations where a human expert corrects a CM’s mispredicted concepts at test time) on CMs' task predictions when inputs are OOD. Our analysis reveals a weakness in current state-of-the-art CMs, which we term *leakage poisoning*, that prevents them from properly improving their accuracy when intervened on for OOD inputs. To address this, we introduce *MixCEM*, a new CM that learns to dynamically exploit leaked information missing from its concepts only when this information is in-distribution. Our results across tasks with and without complete sets of concept annotations demonstrate that MixCEMs outperform strong baselines by significantly improving their accuracy for both in-distribution and OOD samples in the presence and absence of concept interventions.

Lay Summary: Recent advances in Artificial Intelligence (AI) have led to powerful models that can receive help in the form of "concept interventions". A concept intervention is an operation where, during deployment, an expert communicates the presence or absence of a high-level concept in the model's input through a manipulation of its inner representations. This way, for example, radiologists can let an AI assistant know that an X-ray scan has "bone spurs", helping the assistant make a more accurate diagnosis. The real world, however, is messy. This means that the inputs we provide to models contain noise or conditions that differ from those the model was exposed to during training. In this paper, we demonstrate that in these instances, concept interventions fail to properly aid the model in its downstream task. We argue that this is due to "leakage poisoning", where a model's representations become too corrupted for interventions to work. We address this by proposing a way of representing concepts that enables the model to restrict this poisonous leakage whenever the input goes too far from what the model has been exposed to. Our results show that our representations lead to highly accurate models that remain intervenable when provided with expected and unexpected inputs.

Link To Code: https://github.com/mateoespinosa/cem

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability

Keywords: XAI, Concept-based Models, Concept Bottleneck Models, Concept Interventions, Out-of-distristribution, OOD

Submission Number: 4642

Loading