Auto-Guideline Alignment: Probing and Simulating Human Ideological Preferences in LLMs via Prompt Engineering
Keywords: Large Language Models, Human Value Alignment, Human-Label Replacement, Prompt Engineering, Ideological Bias, LLM-as-Judge, Political Stance Modeling
TL;DR: With only a small, synthetic preference set and no model fine-tuning, Auto-Guideline Alignment could replaces costly human labeling by iteratively refining textual guidelines that uncover and steer LLM ideological stances.
Abstract: Aligning large language models (LLMs) with human values usually requires expensive reinforcement learning from human feedback.
We introduce Auto-Guideline Alignment (AGA), a prompt-only framework that uncovers, audits, and steers *hidden* ideological preferences by treating concise, human-readable guidelines as transparent reward proxies, without any parameter updates.
To evaluate AGA, we employ a **GPT-4.1** to generate a dataset of 600 left/right political dilemmas covering 30 topics (five domains $\times$ six subdomains).
Three experiments show that: (1) LLMs exhibit a consistent left-leaning bias; (2) AGA aligns models to all $2^5$ domain-level ideology mixtures, with degraded performance under cross-domain conflict; and (3) intra-domain stance fragmentation leads to unstable alignment and reduced accuracy.
Overall, AGA delivers scalable, transparent, and reproducible value alignment, replacing costly human labeling with explicit rules and iterative self-evaluation.
Submission Number: 59
Loading