Keywords: Fairness, Probing, Null-Space Removal, Adversarial Removal, Spurious Correlation
TL;DR: We theoretically and experimentally demonstrate that probing based null-space and adversarial removal methods fails to remove sensitive attribute from latent representation.
Abstract: Neural network models trained on text data have been found to encode undesired linguistic or sensitive attributes in their representation. Removing such attributes is non-trivial because of a complex relationship between the attribute, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted attributes from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the attributes entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the attribute, which we prove is difficult to train correctly in presence of spurious correlation.