Speeding up Policy Simulation in Supply Chain RL

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Simulating a single trajectory of a dynamical system under some state-dependent policy is a core bottleneck in policy optimization (PO) algorithms. The many inherently serial policy evaluations that must be performed in a single simulation constitute the bulk of this bottleneck. In applying PO to supply chain optimization (SCO) problems, simulating a single sample path corresponding to one month of a supply chain can take several hours. We present an iterative algorithm to accelerate policy simulation, dubbed Picard Iteration. This scheme carefully assigns policy evaluation tasks to independent processes. Within an iteration, any given process evaluates the policy only on its assigned tasks while assuming a certain ‘cached’ evaluation for other tasks; the cache is updated at the end of the iteration. Implemented on GPUs, this scheme admits batched evaluation of the policy across a single trajectory. We prove that the structure afforded by many SCO problems allows convergence in a small number of iterations independent of the horizon. We demonstrate practical speedups of 400x on large-scale SCO problems even with a single GPU, and also demonstrate practical efficacy in other RL environments.
Lay Summary: In the process of designing automated decision-making algorithms, a key bottleneck is simply simulating how well a candidate algorithm performs. In complex systems like supply chains, these simulations can be painfully slow—sometimes taking hours just to simulate one month. We present a new method called Picard Iteration that dramatically speeds up this process. Traditional simulations move step-by-step through time, which makes them hard to parallelize. Picard Iteration takes a different approach: it begins with a rough guess for the entire timeline of the simulation and then repeatedly improves that guess. Crucially, each round of refinement can be done in parallel (i.e., simultaneously) across time steps. This benefits from hardware like GPUs, which are designed to massively speed up parallel computations. Using this method, we achieve up to a 400-fold speedup for supply chain simulations—even on a single GPU—and see similar gains in other decision-making tasks.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/atzheng/picard-iteration-icml
Primary Area: Reinforcement Learning->Everything Else
Keywords: Parallel Computing, Markov Decision Process, Policy Evaluation, Supply Chain, Operations Research
Submission Number: 5002
Loading