Attractor Inversion: A Geometric Account of Adversarial Manipulation in Human Decision-Making

Leo Lorence George; Anushri Iyer; Abhishek Bakshi; Pavan Kulkarni

Attractor Inversion: A Geometric Account of Adversarial Manipulation in Human Decision-Making

Leo Lorence George, Anushri Iyer, Abhishek Bakshi, Pavan Kulkarni

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0

Keywords: adversarial manipulation, human decision-making, recurrent neural networks, phase portrait, attractor dynamics, interpretability, AI safety, reward engineering, auditing, risk stratification

TL;DR: Adversarial reward engineering reshapes the attractor landscape of human decision dynamics, not individual trials; baseline attractor geometry predicts susceptibility before manipulation begins, enabling a deployable auditing framework.

Abstract: Billions of people interact daily with systems that control reward delivery, yet no practical method exists to detect whether those reward schedules are being used to covertly steer user behavior. We close this gap by providing the first geometric, mechanistic account of how adversarial reward engineering works, and the first deployable auditing framework to detect it. Replacing opaque GRU surrogates with interpretable TinyRNNs ($d=4$ hidden units, selected unanimously across all 25 cross-validation folds) and applying phase portrait analysis, we show that adversarial reinforcement learning agents do not manipulate behavior trial-by-trial; instead, they reshape the entire attractor landscape of human decision dynamics. Across two tasks (2-arm bandit and Go/No-Go), the no-reward fixed point inverts from $L^*=-0.24$ to $+1.11$ (permutation $p<0.001$); in Go/No-Go, the nogo attractor sign-inverts from $-2.81$ to $+1.32$ ($p=0.013$) and the go attractor fragments into multiple unstable fixed points ($p=0.007$). Critically, this threat is individually predictable before it begins: baseline attractor geometry predicts susceptibility ($r=-0.60$, $p<0.001$, slope $=-0.86$ logits/logit), and resistant subjects (36%) are geometrically near-indifferent at baseline. These findings yield a concrete auditing protocol: fit a TinyRNN to behavioral logs, extract the $arm0/R=0$ fixed point per user, and flag drift outside the natural reference distribution as evidence of adversarial reward engineering. Prospective risk-stratification is then possible before any manipulation begins.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 72

Loading