The Number of Trials Matters in Infinite-Horizon General-Utility Markov Decision Processes

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The general-utility Markov decision processes (GUMDPs) framework generalizes the MDPs framework by considering objective functions that depend on the frequency of visitation of state-action pairs induced by a given policy. In this work, we contribute with the first analysis on the impact of the number of trials, i.e., the number of randomly sampled trajectories, in infinite-horizon GUMDPs. We show that, as opposed to standard MDPs, the number of trials plays a key-role in infinite-horizon GUMDPs and the expected performance of a given policy depends, in general, on the number of trials. We consider both discounted and average GUMDPs, where the objective function depends, respectively, on discounted and average frequencies of visitation of state-action pairs. First, we study policy evaluation under discounted GUMDPs, proving lower and upper bounds on the mismatch between the finite and infinite trials formulations for GUMDPs. Second, we address average GUMDPs, studying how different classes of GUMDPs impact the mismatch between the finite and infinite trials formulations. Third, we provide a set of empirical results to support our claims, highlighting how the number of trajectories and the structure of the underlying GUMDP influence policy evaluation.
Lay Summary: Many modern AI systems, like those used in robotics or game-playing, rely on learning by trial and error. These systems are typically evaluated based on how often they make good decisions over time. However, in real-world situations, the number of times an AI agent can interact with its environment is limited — and current methods don’t always account for this limitation when designing or judging an AI’s behavior. In our research, we explore how the number of attempts (or “trials”) an AI agent has can significantly affect its performance, especially in situations where success depends on behavior patterns over long periods. We provide insights explaining how an agent's expected performance may change depending on the number of trials and show, through experiments, how different types of problems and environments influence the results. Our work brings us a step closer to designing AI systems that perform reliably, even under limited agent-environment interaction — a common challenge in the real world, from healthcare to autonomous vehicles.
Link To Code: https://github.com/PPSantos/gumdps-number-of-trials
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Planning, sequential decision-making, general-utility markov decision processes, convex markov decision processes
Submission Number: 4103
Loading