Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

Radman Rakhshandehroo; Daniel Coombs

Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

Radman Rakhshandehroo, Daniel Coombs

Published: 24 Mar 2026, Last Modified: 24 Mar 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present ContagionRL, a Gymnasium-compatible reinforcement learning platform specifically designed for systematic reward engineering in spatial epidemic simulations. Unlike traditional agent-based models that rely on fixed behavioral rules, our platform enables rigorous evaluation of how reward function design affects learned survival strategies across diverse epidemic scenarios. ContagionRL integrates a spatial SIRS+D epidemiological model with configurable environmental parameters, allowing researchers to stress-test reward functions under varying conditions including limited observability, different movement patterns, and heterogeneous population dynamics. We evaluate five distinct reward designs, ranging from sparse survival bonuses to a novel potential field approach, across multiple RL algorithms (PPO, SAC, A2C). Through systematic ablation studies, we identify that directional guidance and explicit adherence incentives are critical components for robust policy learning. Our comprehensive evaluation across varying infection rates, grid sizes, visibility constraints, and movement patterns reveals that reward function choice dramatically impacts agent behavior and survival outcomes. Agents trained with our potential field reward consistently achieve superior performance, learning maximal adherence to non-pharmaceutical interventions while developing sophisticated spatial avoidance strategies. The platform's modular design enables systematic exploration of reward-behavior relationships, addressing a knowledge gap in models of this type where reward engineering has received limited attention. ContagionRL is an effective platform for studying adaptive behavioral responses in epidemic contexts and highlight the importance of reward design, information structure, and environmental predictability in learning. Our code is publicly available at https://github.com/redradman/ContagionRL

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We thank the Action Editor for the constructive feedback. Below we summarize the changes made. > "It is not clear how different reward terms are computed. For example, for the directional term, does it depend on the local information, or is some global information used? Some discussions would help the reader." We added three clarifications: (a) a paragraph in Section 3.3 characterizing the information scope of each reward component, distinguishing local terms (health, adherence) from global terms (direction, magnitude), and discussing the resulting information asymmetry under POMDP conditions; (b) a new Appendix C subsection with a table covering all five reward functions, categorizing each component as Local or Global; and (c) an "Information Scope" paragraph in Appendix C.3 clarifying that the potential field force summation iterates over all humans regardless of visibility radius, with a note in the POMDP results section confirming that improved performance under limited visibility reflects genuine policy robustness. > "While the single-agent setting is fine, as many reviewers have pointed out, the impact on the multi-agent interaction system would be important. Hence, few metrics such as total infections caused/avoided on the population-level would make the paper more complete." We added a population-level analysis in the Results section (Table 2). We evaluate all 15 trained models and two baselines over 90 episodes each, reporting mean new infections per timestep to control for survival length. Trained policies reduce population infection rates by 10 to 21 percent relative to baselines, with the Potential Field reward achieving the largest reduction (p < 0.01, Bonferroni-corrected). We are grateful to the reviewers and Action Editor for the thorough review process and the feedback that improved this work.

Video: https://youtu.be/xrWsZDwvxOw

Code: https://github.com/redradman/ContagionRL

Supplementary Material: zip

Assigned Action Editor: ~Arnob_Ghosh3

Submission Number: 6605

Loading