Keywords: alignment, human-AI interaction, minimax, robustness, trustworthy AI
Abstract: We study an agent who combines her private information with recommendations from an informed but potentially misaligned adviser. The adviser observes a signal and, with known probability, reveals it truthfully; otherwise he can send an arbitrary message. We characterize the agent’s inference-and-action rule that delivers the maximal guaranteed payoff. Any optimal rule admits a trust region representation in belief space: advice is taken at face value when it induces a posterior within the trust region, and otherwise the agent acts as if the posterior were on the trust region’s boundary. We show that commitment has no value to the agent and derive thresholds on the truthfulness probability above which the adviser's presence strictly benefits the agent.
Track: Long Paper
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 78
Loading