HOW WELL CAN A LARGE LANGUAGE MODEL INFER THE VALUE REPRESENTATIONS EXPRESSED BY ANOTHER INSTANCE OF THE SAME MODEL?
Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: We study whether large language models can reliably recognize value-conditioned behavior expressed by other instances of the same model across distinct types of social interaction, framing this as a model-to-model representational alignment problem over human values. To this end, we introduce a Generator-Inquisitor framework in which one model generates a text description of a target value in the context of a specific relational dynamic—Communal Sharing, Equality Matching, Authority Ranking, and Market Pricing—and the other model infers the underlying value from the Generator’s text. Despite high value recognition across Gemini, GPT, Llama, and Mistral models, alignment accuracy systematically varied by relational domain and value dimension, with self-enhancement showing the hardest misalignment, especially in the Communal Sharing context. Together, these results show that value representations in LLMs are not abstract model attributes but emerge through domain-specific interactional contexts, motivating evaluation protocols that go beyond single-agent behavioral alignment. Like for humans, this observation calls for contextualized approach to the study of value alignment in machines.
Presenter: ~Maryam_Ghorbansabagh1
Submission Number: 103
Loading