Keywords: Recursive self-improvement, Alignment drift, Regression risk, Long-horizon stability, AI Alignment
TL;DR: SAHOO presents a data driven framework (GDI, CAR, regression safeguards) to measure, predict, and limit alignment drift during recursive self improvement, enabling capability gains while preserving safety constraints.
Abstract: Recursive self-improvement is shifting from theory to practice: modern systems can critique, revise, and evaluate their outputs, yet iterative self-modification risks subtle alignment drift. We introduce \textbf{SAHOO}, a practical framework to monitor and control drift via three complementary safeguards: (i) the \emph{Goal Drift Index} (GDI), a learned multi-signal detector combining semantic, lexical, structural and distributional measures; (ii) \emph{constraint preservation} checks that enforce safety-critical invariants (e.g., syntactic correctness, non-hallucination); and (iii) \emph{regression-risk} quantification to flag when improvement cycles undo prior gains. Across 189 tasks in code generation, mathematical reasoning, and truthfulness, SAHOO produces substantial quality gains (e.g., +18.3\% on code, +16.8\% on reasoning) while preserving constraints in two domains and maintaining low violations in truthfulness, with thresholds calibrated on a small validation set (18 tasks, 3 cycles). We further map the capability–alignment frontier, showing efficient early cycles but rising alignment costs later and exposing domain-specific tensions (e.g., fluency vs.\ factuality). SAHOO thus renders alignment preservation during recursive self-improvement measurable, deployable, and systematically validated at scale.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 124
Loading