Probing Political Ideology in Large Language Models: How Latent Political Representations Generalize Across Tasks

ACL ARR 2025 May Submission5235 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) encode rich internal representations of political ideology, but it remains unclear how these representations contribute to model decision-making, and how these latent dimensions interact with one another. In this work, we investigate whether ideological directions identified via linear probes—specifically, those predicting DW-NOMINATE scores from attention head activations—influence model behavior in downstream political tasks. We apply inference-time interventions to steer a decoder-only transformer along learned ideological directions, and evaluate their effect on three tasks: political bias detection, voting preference simulation, and bias neutralization via rewriting. Our results show that learned ideological representations generalize well to bias detection, but not as well to voting simulations, suggesting that political ideology is encoded in multiple, partially disentangled latent structures. We also observe asymmetries in how interventions affect liberal versus conservative outputs, raising concerns about pretraining-induced bias and post-training alignment effects. This work highlights the risks of using biased LLMs for politically sensitive tasks, and calls for deeper investigation into the interaction of social dimensions in model representations, as well as methods for steering them toward fairer, more transparent behavior.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: probing, language/cultural bias analysis, NLP tools for social analysis, model bias/fairness evaluation, knowledge tracing/discovering/inducing, model editing, probing
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 5235
Loading