Abstract: A central approach to AI-safety research has been to generate aligned AI systems: i.e. systems that do not deceive users and yield actions or recommendations that humans might judge as consistent with their intentions and goals. Here we argue that truthful AIs aligned solely to human intent are insufficient and that preservation of long-term agency of humans may be a more robust standard that may need to be separated and explicitly optimized for. We discuss the science of intent and control and how human intent can be manipulated and we provide a formal definition of agency-preserving AI-human interactions focusing on forward-looking explicit agency evaluations. Our work points to a novel pathway for human harm in AI-human interactions and proposes solutions to this challenge.
Submission Number: 6069
Loading