Probing Classifiers are Unreliable for Concept Removal and Detection

Abhinav Kumar; Chenhao Tan; Amit Sharma

Probing Classifiers are Unreliable for Concept Removal and Detection

Abhinav Kumar, Chenhao Tan, Amit Sharma

Published: 21 Jul 2022, Last Modified: 04 May 2025SCIS 2022 PosterReaders: Everyone

Keywords: Fairness, Probing, Null-Space Removal, Adversarial Removal, Spurious Correlation

TL;DR: We theoretically and experimentally demonstrate that probing based null-space and adversarial removal methods fails to remove sensitive attribute from latent representation.

Abstract: Neural network models trained on text data have been found to encode undesired linguistic or sensitive attributes in their representation. Removing such attributes is non-trivial because of a complex relationship between the attribute, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted attributes from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the attributes entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the attribute, which we prove is difficult to train correctly in presence of spurious correlation.

Confirmation: Yes

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/probing-classifiers-are-unreliable-for/code)

0 Replies

Loading