Learning the Optimal Policy for Balancing Short-Term and Long-Term Rewards

Qinwei Yang; Xueqing Liu; Yan Zeng; Ruocheng Guo; Yang Liu; Peng Wu

Learning the Optimal Policy for Balancing Short-Term and Long-Term Rewards

Qinwei Yang, Xueqing Liu, Yan Zeng, Ruocheng Guo, Yang Liu, Peng Wu

Published: 25 Sept 2024, Last Modified: 28 Dec 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Policy Learning, Short-Term and Long-Term Rewards, Causal Inference, Decision Making

TL;DR: Learn the optimal policy that balances short-term and long-term rewards, especially in scenarios where the long-term outcomes are often missing due to data collection challenges over extended periods.

Abstract: Learning the optimal policy to balance multiple short-term and long-term rewards has extensive applications across various domains. Yet, there is a noticeable scarcity of research addressing policy learning strategies in this context. In this paper, we aim to learn the optimal policy capable of effectively balancing multiple short-term and long-term rewards, especially in scenarios where the long-term outcomes are often missing due to data collection challenges over extended periods. Towards this goal, the conventional linear weighting method, which aggregates multiple rewards into a single surrogate reward through weighted summation, can only achieve sub-optimal policies when multiple rewards are related. Motivated by this, we propose a novel decomposition-based policy learning (DPPL) method that converts the whole problem into subproblems. The DPPL method is capable of obtaining optimal policies even when multiple rewards are interrelated. Nevertheless, the DPPL method requires a set of preference vectors specified in advance, posing challenges in practical applications where selecting suitable preferences is non-trivial. To mitigate this, we further theoretically transform the optimization problem in DPPL into an $\varepsilon$-constraint problem, where $\varepsilon$ represents the minimum acceptable levels of other rewards while maximizing one reward. This transformation provides intuitive into the selection of preference vectors. Extensive experiments are conducted on the proposed method and the results validate the effectiveness of the method.

Supplementary Material: zip

Primary Area: Causal inference

Submission Number: 17045

Loading