From Instructions to Basic Human Values: A Survey of Alignment Goals for Big Models

Anonymous

From Instructions to Basic Human Values: A Survey of Alignment Goals for Big Models

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: As big models demonstrate remarkable performance across diverse tasks, concerns about their potential risks and social harms are raised. Extensive efforts have been made towards aligning big models with humans to ensure their responsible development and human profits maximization. Nevertheless, the basic question "what to align with" remains largely unexplored. It is critical to precisely define the objectives for big models to pursue, and aligning with inappropriate goals could cause disaster, e.g., chatbots promote abusive or biased contents when only following user instructions to interact freely. This paper conducts a comprehensive survey of different alignment goals, tracing their evolution paths to identify the most appropriate goal for big models. Specifically, we categorize existing alignment goals into four primary levels: human instructions, human preferences, value principles and basic values, revealing a learning process that transforms from basic abilities to higher value concepts. For each goal, we further elaborate its definition, how to represent it and how to evaluate it. Posing basic values as a promising goal, we discuss challenges and future research directions.

Paper Type: long

Research Area: Ethics, Bias, and Fairness

Contribution Types: Surveys

Languages Studied: English, Multilingual

0 Replies

Loading