Auto-Guideline Alignment: Probing and Simulating Human Ideological Preferences in LLMs via Prompt Engineering

Chien-Hua Chen; Chang Chih Meng; Li-Ni Fu; Hen-Hsen Huang; I-Chen Wu

Auto-Guideline Alignment: Probing and Simulating Human Ideological Preferences in LLMs via Prompt Engineering

Chien-Hua Chen, Chang Chih Meng, Li-Ni Fu, Hen-Hsen Huang, I-Chen Wu

Published: 10 Jun 2025, Last Modified: 30 Jun 2025MoFA PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Human Value Alignment, Human-Label Replacement, Prompt Engineering, Ideological Bias, LLM-as-Judge, Political Stance Modeling

TL;DR: With only a small, synthetic preference set and no model fine-tuning, Auto-Guideline Alignment could replaces costly human labeling by iteratively refining textual guidelines that uncover and steer LLM ideological stances.

Abstract: Aligning large language models (LLMs) with human values usually requires expensive reinforcement learning from human feedback. We introduce Auto-Guideline Alignment (AGA), a prompt-only framework that uncovers, audits, and steers *hidden* ideological preferences by treating concise, human-readable guidelines as transparent reward proxies, without any parameter updates. To evaluate AGA, we employ a **GPT-4.1** to generate a dataset of 600 left/right political dilemmas covering 30 topics (five domains $\times$ six subdomains). Three experiments show that: (1) LLMs exhibit a consistent left-leaning bias; (2) AGA aligns models to all $2^5$ domain-level ideology mixtures, with degraded performance under cross-domain conflict; and (3) intra-domain stance fragmentation leads to unstable alignment and reduced accuracy. Overall, AGA delivers scalable, transparent, and reproducible value alignment, replacing costly human labeling with explicit rules and iterative self-evaluation.

Submission Number: 59

Loading