From Instructions to Basic Human Values: A Survey of Alignment Goals for Big Model

ACL ARR 2024 June Submission5679 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: As big models demonstrate remarkable performance across diverse tasks, concerns about their potential risks and social harms are raised. Extensive efforts have been made towards aligning big models with humans to ensure their responsible development and human profits maximization. Nevertheless, the question `what to align with' remains largely unexplored. It is critical to precisely define the objectives for big models to pursue, since aligning with inappropriate goals could cause disaster, e.g., chatbots promote abusive or biased contents when only instructed to interact freely. This paper conducts a comprehensive survey of different alignment goals, tracing their evolution paths to identify the most appropriate goal for big models. Specifically, we categorize existing goals into four levels: human instructions, human preferences, value principles and basic values, revealing a learning process from basic abilities to intrinsic value concepts. For each goal, we elaborate its definition, limitation, how techniques are designed to achieve it and how to evaluate the alignment. Posing basic values as a promising goal, we discuss technical challenges and future research directions.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: large language model, value alignment, alignment goals, basic values
Contribution Types: Surveys
Languages Studied: English
Submission Number: 5679
Loading