Superalignment with Dynamic Human Values

Florian Mai; David Kaczér; Nicholas Kluge Corrêa; Lucie Flek

Superalignment with Dynamic Human Values

Florian Mai, David Kaczér, Nicholas Kluge Corrêa, Lucie Flek

Published: 06 Mar 2025, Last Modified: 05 May 2025ICLR 2025 Bi-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: scalable oversight, dynamic values, alignment, reasoning models

TL;DR: We present a roadmap for a scalable oversight algorithm that accounts for evolving human values.

Abstract: Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the \emph{part-to-complete generalization} hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future.

Submission Type: Short Paper (4 Pages)

Archival Option: This is an archival submission

Presentation Venue Preference: ICLR 2025

Submission Number: 54

Loading