Probing Political Ideology in Large Language Models: How Latent Political Representations Generalize Across Tasks

Probing Political Ideology in Large Language Models: How Latent Political Representations Generalize Across Tasks

ACL ARR 2025 May Submission5235 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) encode rich internal representations of political ideology, but it remains unclear how these representations contribute to model decision-making, and how these latent dimensions interact with one another. In this work, we investigate whether ideological directions identified via linear probes—specifically, those predicting DW-NOMINATE scores from attention head activations—influence model behavior in downstream political tasks. We apply inference-time interventions to steer a decoder-only transformer along learned ideological directions, and evaluate their effect on three tasks: political bias detection, voting preference simulation, and bias neutralization via rewriting. Our results show that learned ideological representations generalize well to bias detection, but not as well to voting simulations, suggesting that political ideology is encoded in multiple, partially disentangled latent structures. We also observe asymmetries in how interventions affect liberal versus conservative outputs, raising concerns about pretraining-induced bias and post-training alignment effects. This work highlights the risks of using biased LLMs for politically sensitive tasks, and calls for deeper investigation into the interaction of social dimensions in model representations, as well as methods for steering them toward fairer, more transparent behavior.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: probing, language/cultural bias analysis, NLP tools for social analysis, model bias/fairness evaluation, knowledge tracing/discovering/inducing, model editing, probing

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 5235

Loading