Abstract: Interventions targeting the representation space of language models (LMs) have emerged as effective means to influence model behavior. These methods are employed, for example, to eliminate or alter the encoding of demographic information such as gender within the model's representations, creating a \emph{counterfactual representation}. However, since the intervention operates within the representation space, understanding precisely which features it modifies poses a challenge. We propose a technique for converting representation-space counterfactuals into natural language counterfactuals. We demonstrate that this approach enables us to analyze the linguistic alterations corresponding to a given representation-space intervention and to interpret the features utilized for encoding a specific concept. Moreover, we demonstrate that the resulting counterfactuals can effectively mitigate bias in classification.
Paper Type: short
Research Area: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability
Languages Studied: English
0 Replies
Loading