Stealthy World Model Manipulation via Data Poisoning

Stealthy World Model Manipulation via Data Poisoning

ICLR 2026 Conference Submission14438 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: World Model, Data Poisoning, Optimization

Abstract: Model-based learning agents that use a world model to predict and plan have shown impressive success in solving diverse, complex tasks and adapting to new environments. However, the process of exploring open environments and updating the model with collected experience also exposes them to adversarial manipulation. In this paper, we propose SWAAP, the first scalable and stealthy data poisoning method for world models, designed to benchmark their adversarial robustness. SWAAP uses a novel two-stage approach. In the first stage, the attacker identifies a target world model that deviates only slightly from the true environment but significantly degrades agent's performance when used for planning. This is achieved via a first-order bilevel optimization and a new transition gradient theorem. In the second stage, the attacker then performs the actual attack by perturbing a small subset of fine-tuning data to steer the fine-tuned world model toward the target model. Evaluations using diverse tasks show that our approach induces a substantial performance drop and remains effective even under robust training and detection, underscoring the urgent need for stronger protection in world modeling.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 14438

Loading