From Instructions to Basic Human Values: A Survey of Alignment Goals for Big Model

From Instructions to Basic Human Values: A Survey of Alignment Goals for Big Model

ACL ARR 2024 June Submission5679 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As big models demonstrate remarkable performance across diverse tasks, concerns about their potential risks and social harms are raised. Extensive efforts have been made towards aligning big models with humans to ensure their responsible development and human profits maximization. Nevertheless, the question `what to align with' remains largely unexplored. It is critical to precisely define the objectives for big models to pursue, since aligning with inappropriate goals could cause disaster, e.g., chatbots promote abusive or biased contents when only instructed to interact freely. This paper conducts a comprehensive survey of different alignment goals, tracing their evolution paths to identify the most appropriate goal for big models. Specifically, we categorize existing goals into four levels: human instructions, human preferences, value principles and basic values, revealing a learning process from basic abilities to intrinsic value concepts. For each goal, we elaborate its definition, limitation, how techniques are designed to achieve it and how to evaluate the alignment. Posing basic values as a promising goal, we discuss technical challenges and future research directions.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: large language model, value alignment, alignment goals, basic values

Contribution Types: Surveys

Languages Studied: English

Submission Number: 5679

Loading