Failure Modes in AI Retraining Dynamics

Kiarash Banihashem; Natalie Collina; Nicole Immorlica; Brendan Lucier; Aleksandrs Slivkins

Failure Modes in AI Retraining Dynamics

Kiarash Banihashem, Natalie Collina, Nicole Immorlica, Brendan Lucier, Aleksandrs Slivkins

Published: 07 Jun 2026, Last Modified: 11 Jun 2026ICML 2026 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI retraining, repeated strategic interaction, bandit feedback

Abstract: Modern AI systems are increasingly retrained on data generated through interaction with users. Three forces are at play: (i) the users who strategically adapt their behavior, (ii) a prompting interface which obscures user intent, and (iii) the fact that AI is typically retrained "greedily," ignoring exploration-exploitation tradeoffs. We ask whether these dynamics lead to poor outcomes. We study a stylized model, focusing on the "nice" case when the AI and the users have aligned incentives. We identify two distinct failure modes. First, the system may fail to converge to an optimal Nash equilibrium (of the relevant stage game) due to limited exploration, instead stabilizing at a suboptimal outcome region. This mode is ubiquitous: it happens with a positive probability for \emph{every} problem instance. Second, a non-degenerate subset of problem instances exhibit \emph{model deterioration}, whereby the system converges to an outcome that is strictly worse than the initial state.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Paper Type: Standard paper

Submission Number: 70

Loading