Probing Implicit Bias Risk Framing in Language Models

Published: 02 Mar 2026, Last Modified: 14 Apr 2026AFAA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Tiny/Short Papers Track (up to 3 pages)
Keywords: Bias, Fairness, Large Language Models, Linear Probe, Interpretability
TL;DR: Linear probes show LLM hidden states encode when demographics are implicitly framed as decision-relevant. Under cross-generator transfer, probes beat lexical baselines, indicating representation-level rather than surface encoding.
Abstract: Do large language models encode when demographic information is implicitly framed as decision-relevant? We study 903 synthetic, LLM-generated decision-support prompts in 15 high-stakes domains, labeled according to a controlled framing distinction: demographic mentions as incidental administrative context versus subtly decision-relevant social context. We train linear probes on hidden states and evaluate under cross-generator transfer, requiring generalization across independently generated prompt distributions. Probes outperform both bag-of-words and frozen transformer baselines (0.93 vs. 0.82 BoW vs. 0.71–0.72 embedding AUROC), indicating the signal is not fully reducible to surface lexical cues or off-the-shelf sentence embeddings. The effect holds across Llama and Qwen models, with layer-wise analysis showing architecture-specific peaks. These results provide preliminary evidence that LLM representations linearly encode this controlled framing distinction, while leaving open broader questions about human-grounded implicit bias.
Submission Number: 58
Loading