ProgressGym: Alignment with a Millennium of Moral Progress

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks SpotlightEveryoneRevisionsBibTeXCC BY-NC 4.0
Keywords: Progress Alignment, AI Alignment, Large Language Models
TL;DR: We introduce progress alignment as a solution to risks of AI-induced value lock-in, and build the ProgressGym experimental framework to facilitate the emulation of moral progress in alignment algorithms.
Abstract: Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce **progress alignment** as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce [**ProgressGym**](https://github.com/PKU-Alignment/ProgressGym), an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 [historical LLMs](https://huggingface.co/collections/PKU-Alignment/progressgym-666735fcf3e4efa276226eaa), ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present *lifelong* and *extrapolative* algorithms as baseline methods of progress alignment, and build an [open leaderboard](https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard) soliciting novel algorithms and challenges.
Flagged For Ethics Review: true
Submission Number: 2433
Loading