Privacy Preserving Reinforcement Learning for Population Processes

TMLR Paper2924 Authors

25 Jun 2024 (modified: 04 Nov 2024)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We consider the problem of privacy protection in Reinforcement Learning (RL) algorithms that operate over population processes, a practical but understudied setting that includes, for example, the control of epidemics in large populations of dynamically interacting individuals. In this setting, the RL algorithm interacts with the population over $T$ time steps by receiving population-level statistics as state and performing actions which can affect the entire population at each time step. An individual's data can be collected across multiple interactions and their privacy must be protected at all times. We clarify the Bayesian semantics of Differential Privacy (DP) in the presence of correlated data in population processes through a Pufferfish Privacy analysis. We then give a meta algorithm that can take any RL algorithm as input and make it differentially private. This is achieved by taking an approach that uses DP mechanisms to privatize the state and reward signal at each time step before the RL algorithm receives them as input. Our main theoretical result shows that the value-function approximation error when applying standard RL algorithms directly to the privatized states shrinks quickly as the population size and privacy budget increase. This highlights that reasonable privacy-utility trade-offs are possible for differentially private RL algorithms in population processes. Our theoretical findings are validated by experiments performed on a simulated epidemic control problem over large population sizes.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: The revised manuscript contains the following changes: - New experimental results on a recent graph dataset have now been included in the plots of Figure 3 and 4. The writing has been updated accordingly.   - New experimental results. The experiments now include results on two additional transition parameter settings. Figure 3 contains graphs plotting the private reward performance of the RL algorithm as the transition parameters, population size and privacy parameters vary. New results on the true reward performance of the RL algorithm are shown in Figure 4. Note that the number of time steps was reduced from $T=5e5$ to $T=2e5$ in all experiments to ensure the experiments would complete in time.   - Updated writing in Section 5 and Section 6 to reflect the new results. Section 5 now also emphasizes that the epidemic control problem is representative of population process environments.   - Updated writing in section 3.1 to correct for the fact that the data curator, who is part of the environment, computes the reward and not the agent.   - The plot showing target privacy vs achieved privacy has been shifted to the appendix.
Assigned Action Editor: ~Naman_Agarwal1
Submission Number: 2924
Loading