Sociolinguistic Simulacra: Interactions Between Language and Attitudes in Fine-Tuned Language Models

ACL ARR 2025 May Submission4197 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in large language models have demonstrated their capacity to generate human-like text over a range of applications. However, aligning these models to specific behavioral preferences, such as political neutrality or desirable personality traits, remains a challenge. Current alignment approaches prominently include reward-based post-training techniques such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), whose effects depend on models' inductive biases in ways which are important but poorly understood. In this paper, we investigate the effects of low-level linguistic features in DPO preference data on a language model’s higher-level behaviors, including its personality traits and self-reported demographic attributes. Using DPO, we post-train models on datasets consisting of paired English texts with regionally marked differences in orthography and usage, and assess the resulting models' personality traits using established frameworks, with the aim of providing insight into how cultural and linguistic inputs shape language model behavior.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: fine-tuning, dialects and language varieties, adversarial attacks/examples/training, model bias/fairness evaluation, values and culture, safety and alignment
Languages Studied: English
Submission Number: 4197
Loading