Gremlin: AI-Based Adversarial Stress-Testing for Autonomous Space Systems

Marie Ethvignot; Masahiro Ono

Gremlin: AI-Based Adversarial Stress-Testing for Autonomous Space Systems

Marie Ethvignot, Masahiro Ono

Published: 28 Apr 2026, Last Modified: 15 May 2026IEEE ICRA 2026 Workshop SRWEveryoneRevisionsBibTeXCC BY 4.0

Keywords: space robotics, autonomy, verification and validation, adversarial testing, Monte Carlo Tree Search, Gaussian processes, planetary exploration

TL;DR: We introduce Gremlin, an adversarial AI agent that stress-tests autonomous spacecraft by intelligently searching for realistic failure-inducing disturbance sequences for simulation-based V&V.

Abstract: As robotic space missions push into increasingly remote and poorly characterized environments, onboard autonomy is becoming essential. Yet the verification and validation (V&V) methods used to certify autonomous systems before launch were designed for largely pre-programmed spacecraft and do not scale to systems whose behavior evolves in response to unknown conditions. This work introduces Gremlin, an adversarial framework for simulation-based V&V of autonomous spacecraft. Acting as an intelligent adversary within a mission simulator, Gremlin injects structured disturbances during execution and uses Monte Carlo Tree Search to direct exploration toward scenarios that degrade mission performance. Disturbances are modeled as continuous correlated functions through a Gaussian process prior $g(\tau) \sim \mathcal{GP}(0, \sigma^2 k(\tau, \tau'))$, and a global Mahalanobis budget constraint $d_{0:T}^\top K^{-1} d_{0:T} \leq C$ ensures that identified scenarios remain statistically plausible under the assumed uncertainty model. This is a deliberate design choice: in physical systems, the most operationally relevant failure scenarios are not the most extreme ones but the plausible ones that happen to stress the system in unexpected ways. Gremlin therefore searches for scenarios at the boundary of credible behavior rather than pushing toward statistically unlikely extremes. Evaluated on a Uranus atmospheric entry and relay communication scenario, Gremlin exposes a failure mode in which the relay window simultaneously shifts and compresses due to coupled aerodynamic and pressure-based termination effects, a scenario that independent parameter sweeps and standard sensitivity analysis are unlikely to uncover, as it requires a correlated multi-event disturbance, resulting in a 19.7\% reduction in science data return under a statistically plausible atmospheric disturbance. While demonstrated on atmospheric entry, Gremlin applies to any autonomous system evaluable in simulation under uncertainty, including rover traverse planning, autonomous rendezvous, or onboard science scheduling. As autonomy becomes more capable and adaptive, the tools used to certify it must become more intelligent too.

Submission Number: 30

Loading