GRADIEND: Feature Learning within Neural Networks Exemplified through Biases

GRADIEND: Feature Learning within Neural Networks Exemplified through Biases

ICLR 2026 Conference Submission20715 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Feature Learning, Bias Mitigation, AI Fairness, Language Models

TL;DR: Using model gradients, we learn a single feature neuron that encodes a desired feature along an orthogonal axis (e.g., gender) and show that this can debias models.

Abstract: AI systems frequently exhibit and amplify social biases, leading to harmful consequences in critical areas. This study introduces a novel encoder-decoder approach that leverages model gradients to learn a feature neuron encoding societal bias information such as gender, race, and religion. We show that our method can not only identify which weights of a model need to be changed to modify a feature, but even demonstrate that this can be used to rewrite models to debias them while maintaining other capabilities. We demonstrate the effectiveness of our approach across various model architectures and highlight its potential for broader applications.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 20715

Loading