Abstract: Many real-world decision problems, ranging from asset-maintenance scheduling to portfolio rebalancing, can be naturally modelled as budget-constrained multi-component monotonic Partially Observable Markov Decision Processes (POMDPs): each component’s latent state degrades stochastically until an expensive restorative action is taken, while all assets share a fixed intervention budget.
For a large numbers of assets, deriving an optimal policy for this joint POMDP is computationally intractable. To tackle this challenge, we prove that the value function of the associated belief-MDP is \emph{budget-concave}, which allows an efficient two-step approach to finding a near-optimal policy. First, we approximate the optimal cross-component budget split via a random-forest surrogate of each single-component value function. Second, we solve each resulting budget-constrained single-component POMDP with an oracle-guided meta-trained Proximal Policy Optimization (PPO) policy: value-iteration on the fully observable counterpart yields an oracle that shapes the PPO update and greatly accelerates learning. We validate our method through experiments in two disparate domains: (i) preventive maintenance for a large-scale building infrastructure containing 1,000 components, and (ii) portfolio risk management under debit-only loss-budget constraints, where each asset’s latent budget depletes with market losses and can only be replenished through costly recapitalization. Results show that our method consistently achieves longer component survival times and enhanced portfolio viability than both baseline heuristics and vanilla PPO. Furthermore, our approach maintains linear scalability in solution time with respect to the number of components.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=yEAnjlmliL&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: We substantially revised the manuscript in response to the reviews.
First, we corrected the concavity analysis in Sec. 4.1. The previous proof incorrectly relied on the pointwise maximum of concave functions being concave. We removed that argument and now prove budget concavity for the expected-cost relaxation,
$$
V_H^{\rm soft}(b,B)=\sup_{\pi:K_H(\pi\mid b)\le B}J_H(\pi\mid b).
$$
We then relate this relaxed value to the original hard-budget objective as an upper envelope, give conditions under which the relaxation is exact, and provide a soft-hard gap bound in terms of the hard-budget violation probability. We also add a Doob-martingale/Azuma-Hoeffding bound and an empirical validation plot showing that the observed violation probabilities are small for representative infrastructure components.
Second, we expanded the motivation for the transition model in Sec. 3. We now explain that the transition kernel abstracts a standard condition-based maintenance structure over ordered condition states: passive operation and inspection follow stochastic deterioration dynamics, while maintenance/repair/replacement changes the transition law to a repair-effect kernel. We added recent references from partially observable maintenance, inspection planning, and infrastructure asset management to support this modeling choice.
Third, we clarified the role of monotonicity and the scope of the structural assumptions. The revised Introduction, Related Work, and Problem Formulation now explain that “monotonic POMDP” refers to the deterioration-restoration structure used in the paper, rather than a new general POMDP class. We also added a Limitations discussion clarifying that direct cross-component coupling or strongly non-monotone dynamics would require a different model.
Fourth, we added new experiments and ablations in Sec. 5.1.2 and the appendix. We now include a surrogate-family ablation comparing exponential, logarithmic, power-law, Hill/Michaelis-Menten, tanh, and piecewise-linear concave surrogates in both in-distribution and extrapolation settings. These results show that the method is not brittle to the exponential form; rather, several monotone concave saturating surrogates perform similarly, while the exponential form remains a simple and stable representative used in our random-forest parameter-prediction pipeline.
Fifth, we added a static-versus-periodic budget reallocation experiment. The new results show that periodic reallocation can provide modest mean survival-time gains, but requires repeated residual surrogate refits and therefore incurs substantially higher runtime. This supports the use of one-shot static allocation as the main scalable method.
Sixth, we added a direct comparison with the welfare-maximization method of Vora et al. (2023). We refer to this as the POMCP-welfare baseline. At $N=5$ and $N=10$, it achieves comparable total survival time to our method, but requires substantially larger wall-clock time due to repeated POMCP computations. We include aggregate results in the main text and detailed per-component plots in the appendix.
Finally, we clarified baseline hyperparameter selection in both experimental domains. For the infrastructure setting, we now describe the grid search used to select the inspection interval and repair threshold. For the financial setting, we describe the chronological validation split and grid search used to select the inspection interval and recapitalization threshold. We also made several smaller edits throughout the paper to improve flow, notation, references, captions, and appendix organization.
Assigned Action Editor: ~Tongzheng_Ren1
Submission Number: 6126
Loading