"There are no solutions, only trade-offs.'' Taking A Closer Look At Safety Data Annotations.

Published: 10 Oct 2024, Last Modified: 15 Nov 2024Pluralistic-Alignment 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: nlp, alignment, reward models
TL;DR: We study the downstream safety effects of aggregation on multi-annotated datasets with demographically diverse participants.
Abstract: AI alignment, the last step in the training pipeline, ensures that large language models model desirable goals and values to improve helpfulness, reliability, and safety. Existing approaches typically rely on supervised learning algorithms with data labeled by human annotators. But sociodemographic and personal contexts are at play in annotating for alignment objectives. In safety alignment particularly, the labels are generally confusing, and the moral ethics of "What $\textit{should}$ an LLM do?" is even more perplexing and lacks a clear ground truth. We seek to understand the effects of aggregation on multi-annotated datasets with demographically diverse participants, particularly the implications for safety on subjective preferences. This paper offers quantitative and qualitative analysis of aggregation methods on safety data and their potential ramifications on alignment. Our results show that safety annotations are mutually contradictory and that existing strategies to reconcile these disagreements fail to remove this contradiction. Crucially, we find that annotator labels are sensitive to intersectional differences erased by existing aggregation methods. We additionally explore evaluation perspectives from social choice theory. Our findings suggest that social welfare metrics offer insights on the relative disadvantages to minority groups.
Submission Number: 45
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview