Keywords: Memorisation, World Models, Model-Based Reinforcement Learning, Membership Inference, Privacy
TL;DR: We audit memorisation in MBRL world models. Reconstruction-based MIA reaches AUC=0.999 on IRIS/Ms. Pac-Man where loss-based MIA fails — consistent with leakage concentrating in the decoder, not the likelihood surface.
Abstract: Model-based reinforcement learning (MBRL) agents such as
DreamerV3 and IRIS train a \emph{world model} on replay-buffer trajectories and then
optimise their policies inside their ``imagination.'' We present the
first systematic membership-inference audit of MBRL world models,
adapting three attack families (trajectory reconstruction, dynamics-loss MIA, and
adversarial-action divergence) to the action-conditioned generative
setting. We test for leakage across DreamerV3 and IRIS on four Atari
games. On the strongest configuration---IRIS / Ms.\ Pac-Man---reconstruction
attains AUC$=0.999$ with Cohen's $d=-4.76$ at horizon $H{=}30$, and
TPR$=0.98$ at $1\%$ FPR, exceeding signals typically reported for
language and diffusion models; on DreamerV3 / Krull, reconstruction
(AUC$=0.682$) and adversarial divergence ($p<10^{-10}$) independently
corroborate membership. Nonetheless, the attack families can disagree
sharply: on the IRIS / Ms.\ Pac-Man checkpoint that yields
near-perfect reconstruction, loss-MIA flags zero members at the same
$1\%$-FPR threshold, and five of the eight loss-MIA evaluations score
below random. We attribute this disagreement to
collection-policy state-space mismatch between members and
non-members, which swamps likelihood-based scores while leaving
pixel-level signals intact. The implication is that memorisation in
pixel-generative world models concentrates in the decoder
pathway---the inverse of the language-model setting in which
loss-based MIA is the standard tool.
Submission Number: 171
Loading