Reducing information dependency does not cause training data privacy. Adversarially non-robust features do.

Rasmus Torp; Shailen Smith; Adam Breuer

Reducing information dependency does not cause training data privacy. Adversarially non-robust features do.

Rasmus Torp, Shailen Smith, Adam Breuer

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Privacy, model inversion attacks, extraction attacks, adversarial examples, memorization, training data, causal inference, causality

TL;DR: We challenge the prevailing view that reducing information dependency (including rote memorization) causes training data privacy under model inversion attacks (MIAs), and we show that instead, adversarially non-robust features do.

Abstract: In this paper, we challenge the prevailing view that information dependency (including rote memorization) drives training data exposure to image reconstruction attacks. We show that extensive exposure can persist without rote memorization and is instead caused by a tunable connection to adversarial robustness. We begin by presenting three surprising results: (1) recent defenses that inhibit reconstruction by Model Inversion Attacks (MIAs), which evaluate leakage under an idealized attacker, do not reduce standard measures of information dependency (HSIC); (2) models that maximally memorize their training datasets remain robust to MIA reconstruction; and (3) models trained without seeing 97% of the training pixels, where recent information-theoretic bounds give arbitrarily strong privacy guarantees under standard assumptions, can still be devastatingly reconstructed by MIA. To explain these findings, we provide causal evidence that privacy under MIA arises from what the adversarial examples literature calls "non-robust" features (generalizable but imperceptible and unstable features). We further show that recent MIA defenses obtain their privacy improvements by unintentionally shifting models toward such features. To establish this causal relationship, we introduce **A**n**t**i **A**dversarial **T**raining (**AT-AT**), a training regime that intentionally learns non-robust features to obtain both superior reconstruction defense and higher accuracy than state-of-the-art defenses. Our results revise the prevailing understanding of training data exposure and reveal a new privacy-robustness tradeoff.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 5445

Loading