High-Dimension Human Value Representation in Large Language Models

ACL ARR 2024 June Submission413 Authors

11 Jun 2024 (modified: 06 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The widespread application of Large Language Models (LLMs) across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, such as Reinforcement Learning with Human Feedback (RLHF), constitutional learning, safety fine tuning, etc., there is an urgent need to understand the scope and nature of human values injected into these LLMs before their deployment and adoption. We propose UniVaR, a high-dimensional neural representation of symbolic human value distributions in LLMs, orthogonal to model architecture and training data. This is a continuous and scalable representation, self-supervised from the value-relevant output of 8 LLMs and evaluated on 15 open-source and commercial LLMs. Through UniVaR, we visualize and explore how LLMs prioritize different values in 25 languages and cultures, shedding light on the complex interplay between human values and language modeling.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP, Human-Centered NLP, Multilingualism and Cross-Lingual NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English (eng); Chinese (zho); Korean (kor); Japanese (jpn); German (deu); Finnish (fin); Swedish (swe); French (fra); Italian (ita); Portuguese (por); Spanish (spa); Thai (tha); Vietnamese (vie); Malay (zsm); Tagalog (tgl); Haitian (hat); Quechua (quy); Russian (rus); Romanian (ron); Bulgarian (bul); Indonesian (ind); Arabic (arb); Swahili (swh); Hindi (hin); Persian (pes)
Submission Number: 413
Loading