Abstract: To assess the ethical risks associated with Large Language Models (LLMs), researchers have proposed various datasets to analyze the models' inclinations towards values. These datasets typically involve surveys and psychometric tests that require short-form responses from the LLMs. In this paper, we investigate the extent to which the value preferences estimated from these benchmarks align with downstream applications involving long-form generations. As the goal of alignment is to align the models with a consistent set of values and principles, so we analyze its impact for this experiment on 5 LLMs: llama3-8b, gemma2-9b, mistral-7b, qwen2-7b and olmo-7b. Our analysis reveals that while alignment can improve the consistency between value preferences estimated from benchmarks and long-form responses, the correlation remains weak, indicating a discrepancy between preferences in different applications. Furthermore, value preferences exhibited in long-form generations can vary significantly across generations obtained by temperature sampling. Finally, we explore the connection between the models' proficiency in generating specific and diverse value-laden arguments and their value preferences. Empirical results demonstrate that for highly preferred values, most models generate less specific arguments but more diverse arguments.
Paper Type: Long
Research Area: Computational Social Science and Cultural Analytics
Research Area Keywords: Value Alignment, Ethical Dilemma, Value Preferences, Consistency
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8001
Loading