On Optimal Steering to Achieve Exact Fairness

mohit sharma; Amit Deshpande; Chiranjib Bhattacharyya; Rajiv Ratn Shah

On Optimal Steering to Achieve Exact Fairness

mohit sharma, Amit Deshpande, Chiranjib Bhattacharyya, Rajiv Ratn Shah

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: steering, fairness, representation, bayes optimal classifiers, data bias, ideal distributions

TL;DR: Steering given distributions towards ideal distributions, where fairness and accuracy are not at a tradeoff.

Abstract: To fix the `bias in, bias out' problem in fair machine learning, it is important to steer feature distributions of data or internal representations of Large Language Models (LLMs) to \emph{ideal} ones that guarantee group-fair outcomes. Previous work on fair generative models and representation steering could greatly benefit from provable fairness guarantees on the model output. We define a distribution as \emph{ideal} if the minimizer of any cost-sensitive risk on it is guaranteed to have exact group-fair outcomes (e.g., demographic parity, equal opportunity)---in other words, it has no fairness-utility trade-off. We formulate an optimization program for optimal steering by finding the nearest \emph{ideal} distribution in KL-divergence, and provide efficient algorithms for it when the underlying distributions come from well-known parametric families (e.g., normal, log-normal). Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility (and sometimes even improve utility). We demonstrate affine steering of LLM representations to reduce bias in multi-class classification, e.g., occupation prediction from a short biography in Bios dataset (De-Arteaga et al.). Furthermore, we steer internal representations of LLMs towards desired outputs so that it works equally well across different groups.

Supplementary Material: zip

Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)

Submission Number: 21793

Loading