The Hawk Effect: Why We Need a Two-Dimensional Measure of Machine Intelligence
Keywords: Definitions of Intelligence, AI Evaluation, AI Alignment, Epistemic Intelligence, Reinforcement Learning
TL;DR: Our work explores "the hawk effect" - the possibility of training adversarial policies against powerful self-play models and its consequences, in particular necessitating an at least 2 dimensional definition of "intelligence" .
Abstract: Deep reinforcement-learning agents achieve superhuman performance in
combinatorially complex domains, yet can collapse when subjected to
specialized, dynamically trained adversarial policies---a phenomenon
we term the \emph{Hawk Effect}. We argue that this vulnerability is
not merely an empirical anomaly but a \emph{structural risk}
associated with finite-capacity deployed systems under adaptive
attack, and that it forces a philosophical question: what should
``intelligence'' mean for an ML system whose cheap-to-train
adversaries can route it from triumph to catastrophe in a few moves?
We propose treating deployed machine intelligence as at least
two-dimensional, with one axis capturing domain-normalized
standard-task competence (Hutter-inspired) and a second axis
capturing adaptive adversarial work-to-failure (Pearl-inspired). We
propose the \emph{Adversarial Work Criterion} (AWC) as a concrete
first instantiation of the second axis. The paper is intentionally argumentative; we devote a
section to alternative views.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 16
Loading