What Changed? Converting Representational Interventions to Natural Language

Anonymous

What Changed? Converting Representational Interventions to Natural Language

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Interventions targeting the representation space of language models (LMs) have emerged as effective means to influence model behavior. These methods are employed, for example, to eliminate or alter the encoding of demographic information such as gender within the model's representations, creating a \emph{counterfactual representation}. However, since the intervention operates within the representation space, understanding precisely which features it modifies poses a challenge. We propose a technique for converting representation-space counterfactuals into natural language counterfactuals. We demonstrate that this approach enables us to analyze the linguistic alterations corresponding to a given representation-space intervention and to interpret the features utilized for encoding a specific concept. Moreover, we demonstrate that the resulting counterfactuals can effectively mitigate bias in classification.

Paper Type: short

Research Area: Interpretability and Analysis of Models for NLP

Contribution Types: Model analysis & interpretability

Languages Studied: English

0 Replies

Loading