TL;DR: We study when an AI assistant would have an incentive to interfere with a human's observations.
Abstract: We study partially observable assistance games (POAGs), a model of the human-AI value alignment problem which allows the human and the AI assistant to have partial observations. Motivated by concerns of AI deception, we study a qualitatively new phenomenon made possible by partial observability: would an AI assistant ever have an incentive to interfere with the human's observations? First, we prove that sometimes an optimal assistant must take observation-interfering _actions_, even when the human is playing optimally, and even when there are otherwise-equivalent actions available that do not interfere with observations. Though this result seems to contradict the classic theorem from single-agent decision making that the value of perfect information is nonnegative, we resolve this seeming contradiction by developing a notion of interference defined on entire _policies_. This can be viewed as an extension of the classic result that the value of perfect information is nonnegative into the cooperative multiagent setting. Second, we prove that if the human is simply making decisions based on their immediate outcomes, the assistant might need to interfere with observations as a way to query the human's preferences. We show that this incentive for interference goes away if the human is playing optimally, or if we introduce a communication channel for the human to communicate their preferences to the assistant. Third, we show that if the human acts according to the Boltzmann model of irrationality, this can create an incentive for the assistant to interfere with observations. Finally, we use an experimental model to analyze tradeoffs faced by the AI assistant in practice when considering whether or not to take observation-interfering actions.
Lay Summary: When would an AI assistant interfere with a human’s observations? It’s clear that a misaligned assistant might interfere with observations in order to deceive. But what about a perfectly aligned assistant?
We identify three distinct reasons for even a perfectly aligned AI assistant to interfere with human observations. Interfering with observations sometimes helps an assistant communicate its own private observations to humans, query human preferences, and improve the decision making of irrational humans.
Our results complicate the picture, suggesting that not all observation interference is inherently bad. Our work is theoretical, laying a foundation for future work to understand and address the nuanced issue of observation interference in practice.
Primary Area: Theory->Game Theory
Keywords: assistance games, AI alignment, observation interference, partial observability, partially observable assistance games
Submission Number: 13868
Loading