Keywords: assistance games, assistants, alignment, interaction
Abstract: Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, like incentives for deceptive behavior, by explicitly modeling the interaction between assistant and user as a two-player game where the assistant cannot observe the user's goal. Despite their potential, assistance games have only been explored in simple settings. Scaling them to more complex environments is difficult because it requires both accurately modeling human users' behavior and determining optimal actions in uncertain sequential decision-making problems. We tackle these challenges by introducing a deep reinforcement learning (RL) algorithm called AssistanceZero for solving assistance games and applying it to a Minecraft-based assistance game with over $10^{400}$ possible goals. We show that an AssistanceZero assistant effectively assists simulated humans in achieving unseen goals and outperforms assistants trained with imitation learning and model-free RL. Our results suggest that assistance games are more tractable than previously thought, and that they are an effective framework for assistance at scale.
Submission Number: 133
Loading