Learning Robust Penetration Testing Policies under Partial Observability: A systematic evaluation

Published: 06 May 2026, Last Modified: 06 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Penetration testing, the simulation of cyberattacks to identify security vulnerabilities, presents a sequential decision-making problem well-suited for reinforcement learning (RL) automation. Like many applications of RL to real-world problems, partial observability presents a major challenge, as it invalidates the Markov property present in Markov Decision Processes (MDPs). Partially Observable MDPs require history aggregation or belief state estimation to learn successful policies. We investigate stochastic, partially observable penetration testing scenarios over host networks of varying size, aiming to better reflect real-world complexity through more challenging and representative benchmarks. This approach leads to the development of more robust and transferable policies, which are crucial for ensuring reliable performance across diverse and unpredictable real-world environments. Using vanilla Proximal Policy Optimization (PPO) as a baseline, we compare a selection of PPO-based variants designed to mitigate partial observability, including frame-stacking, augmenting observations with historical information, and employing LSTM or TrXL architectures. We conduct a systematic empirical analysis of these algorithms across different host network sizes. We find that this task greatly benefits from history aggregation. Converging up to four times faster than other approaches. Manual inspection of the learned policies by the algorithms reveals clear distinctions and provides insights that go beyond quantitative results.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Changes brought to the manuscript: Section 2 (Background) - Reordered to present penetration testing before mathematical frameworks - Added explicit mapping between pentesting concepts and RL components at end of Section 2.1 - Improved motivation for why MDPs/POMDPs/RL are relevant to the domain Section 3 (Related Work) - Added explicit definitions of offensive vs. defensive agents and lateral movement vs. full penetration testing - Removed speculative language (e.g., "justifiable") about prior work choices - Reordered logical flow: gap identified → recent work → limitations → our contribution Section 4.1 (Methodology) - Added comparison table contrasting NASim's original capabilities with StochNASim's extensions Section 4.1.5 (Observations) - Added citation to penetration testing tools (Nmap) justifying transient observations - Clarified encoding scheme: unscanned/absent attributes both encoded as zeros Section 5.1 (Setup) - Clarified parameter choices are adapted from original NASim benchmark for consistency with prior work Section 5.4 (Stochasticity Analysis) - Added explanation of how partial observability enables memorization with fixed scenarios - Added quantitative train-test generalization metrics comparing NASim vs. StochNASim Scope and Claims Refinements (Abstract, Introduction, Discussion, and Conclusion) - Tempered claims about TrXL performance to specify "within our setting and hyperparameter regime" - Added explicit caveats that findings are validated for PPO-style on-policy methods - Acknowledged need for validation with off-policy methods and additional baselines - Standardized speedup claims for consistency throughout manuscript Discussion Section - Reframed contribution as identifying well-characterized problem class rather than claiming universal superiority - Positioned larger-scale evaluation and studies on dynamic scenarios as concrete future work - Noted that dynamic scenarios would likely favor learned memory approaches
Code: https://github.com/raphsimon/StochNASim
Supplementary Material: zip
Assigned Action Editor: ~George_Trimponias2
Submission Number: 6427
Loading