Keywords: large language models, alignment, bias, stance elicitation, semantic spaces, black-box analysis, interpretability, evaluation
TL;DR: Stance elicitation lets us construct semantic spaces for LLMs, making biases, alignment inconsistencies, and stereotyping explicit.
Abstract: We present a stance-tensor method for visualizing the semantic spaces of large language models (LLMs).
The method constructs stance vectors from model responses to structured entity--policy queries and uses these vectors to derive low-dimensional representations of the underlying semantic structure, enabling direct comparison of generic descriptors with explicit rule-based specifications.
Across multiple state-of-the-art LLMs, the approach allows us to identify consistent patterns, including a stable triangular configuration of U.S. political party anchors, close correspondence between party programs and philosophical traditions, clustering of generic normative terms in a consistent region associated with Rawlsian principles, expected placement of Pew political-typology groups, coherent cross-national mapping of German parties into U.S. political space, a strong correlation between PCA-derived left--right scores and Manifesto Project RILE values, substantial inter-model variation in demographic stereotyping, and systematic divergences between generic and rule-based definitions of alignment and legal systems.
These results show that simple stance-based probes reveal stable and reproducible semantic structure in LLMs and provide a direct mechanism for identifying inconsistencies between default assumptions, explicit rule sets, and institutional frameworks.
Because these discrepancies form measurable error signals in the stance tensor, the same framework can be used not only for auditing but also for improving model alignment through targeted training.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15668
Loading