- Keywords: Reinforcement learning, Mean-variance tradeoff
- Abstract: Risk management is critical in decision making, and mean-variance (MV) trade-off is one of the most common criteria. However, in reinforcement learning (RL) for sequential decision making under uncertainty, most of the existing methods for MV control suffer from computational difficulties owing to calculating the gradient of the variance term. In this paper, in contrast to strict MV control, we consider learning MV efficient policies that achieve Pareto efficiency regarding MV trade-off. To achieve this purpose, we train an agent to maximize the expected quadratic utility function, a common objective of risk management in finance and economics. We call our approach RL based on expected quadratic utility maximization (EQUMRL). The EQUMRL does not suffer from the computational difficulties because it does not include gradient estimation of the variance. We confirm that the maximizer of the objective in the EQUMRL directly corresponds to an MV efficient policy under a certain condition. We conduct experiments with benchmark settings to demonstrate the effectiveness of the EQUMRL.
- One-sentence Summary: Learning a policy achieving a Pareto efficiency in the sense of the mean variance trade-off by maximizing the expected quadratic utility function.
- Supplementary Material: zip