%\documentclass{uai2023} % for initial submission
 \documentclass[accepted]{uai2023} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
\bibliographystyle{plainnat}
\renewcommand{\bibsection}{\subsubsection*{References}}
%\usepackage{mathtools} % amsmath with fixes and additions
%% \usepackage{siunitx} % for proper typesetting of numbers and units
%\usepackage{booktabs} % commands to create good-looking tables
%\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{amsmath,amssymb,amsfonts,amsthm}
\usepackage{algorithm,algorithmic}
\usepackage{mathtools, bbm}
\usepackage[mathscr]{euscript}
\usepackage{dsfont}
\usepackage{booktabs,multirow}
\usepackage{nicefrac}
\usepackage{subfig}
\newtheorem{definition}{Definition}
\newtheorem{proposition}{Proposition}
\newtheorem{lemma}{Lemma}
\newtheorem{assumption}{}
\renewcommand\theassumption{(A\arabic{assumption})}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\newtheorem{remark}{Remark}
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\p}{\mathbb{P}}
\newcommand{\1}{\mathds{1}}
\allowdisplaybreaks
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
%\newcommand{\swap}[3][-]{#3#1#2} % just an example



\title{A policy gradient approach for optimization of smooth risk measures}


%
% Add authors
%\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2023 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Nithia Vijayan}
\author[1]{Prashanth L.A.}
% Add affiliations after the authors
\affil[1]{%
    Department of Computer Science and Engineering,
    Indian Institute of Technology Madras, India.
}

\begin{document}
\maketitle
\begin{abstract}
We propose policy gradient algorithms for solving a risk-sensitive reinforcement learning (RL) problem in on-policy as well as off-policy settings. We consider episodic Markov decision processes, and model the risk using the broad class of smooth risk measures of the cumulative discounted reward. We propose two template policy gradient algorithms that optimize a smooth risk measure in on-policy and off-policy RL settings, respectively. We derive non-asymptotic bounds that  quantify the rate of convergence of our proposed algorithms to a stationary point of the smooth risk measure. As special cases, we establish that our algorithms apply to optimization of mean-variance and distortion risk measures, respectively.
\end{abstract}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
\label{sec:intro}
Risk-sensitive reinforcement learning (RL) has received a lot of attention recently in the literature, and a few representative works are  \cite{tamar2012policy,prashanth2014cvar,tamar2015coherent,borkar2010risk,chow2017risk,prashanth2016mlj,borkar2010learning,prashla16,huang2017risk}. Mean-variance tradeoff \cite{markowitz1952portfolio}, value at risk (VaR), conditional value at risk (CVaR) \cite{rockafellar2000optimization}, spectral risk measure \cite{acerbi2002spectral}, distortion risk measure \cite{denneberg1990}, a risk measure based on cumulative prospect theory (CPT) \cite{tversky1992advances} are some of the popular risk measures considered in the literature.

Policy gradients form a popular solution approach for traditional risk-neutral RL. The idea here is to consider a parameterized set of policies, usually in a continuous space, and perform a random search using stochastic gradient ascent to find a `good-enough' policy that optimizes a certain performance criterion. Several risk-sensitive RL algorithms employ this approach to find policies that are risk-optimal, see \cite{prashla2021} for a detailed survey of some of the recent developments in this research direction.

In this paper, we consider the problem of optimizing an abstract smooth risk measure (SRM) in a risk-sensitive RL context. SRMs constitute a broad class of risk measures that includes mean-variance risk measure (MVRM) and distortion risk measure (DRM). Mean-variance tradeoff is a well-known risk measure that is closely related to exponential cost risk measure -- a connection that can be seen using a Taylor series expansion (cf. \cite{prashanth2016mlj}). Next, DRM is an expectation w.r.t. a distorted distribution that is arrived at using a distortion function that alters the underlying cumulative distribution function (CDF). Popular risk measures like VaR and CVaR can be seen as special cases of DRM using appropriate distortion functions. However, VaR is not a popular objective for risk-sensitive optimization since it is not coherent\footnote{A risk measure is said to be coherent if it is translation invariant, sub-additive, positive homogeneous, and monotonic \cite{artzner99}.}, while CVaR, though coherent, is not preferable, as it considers all rewards below VaR equally, while ignoring all those beyond VaR. A DRM is preferable as it prioritizes all rewards appropriately, rather than assigning equal weight or selectively focusing on a fraction using a tail-based risk measure like CVaR.

We employ the policy gradient approach for solving a risk-sensitive Markov decision process (MDP), with an SRM as the objective. The goal in our formulation is to find a policy that maximizes the SRM of the cumulative reward in an episodic MDP. We propose a template policy gradient algorithm to solve this problem for an abstract SRM. The template algorithm has the following  crucial components: a risk estimation scheme and a gradient estimation scheme. The risk estimation scheme for an abstract SRM is assumed to guarantee a $O(\nicefrac{1}{m})$ mean-square error (MSE), where $m$ is the number of episodes. With an expected value objective in a risk-neutral setting, this MSE requirement is natural. For the case of MVRM and DRMs, we manifest such a bound for natural estimators. We would like to add that, unlike expected value where a sample mean was a good estimator, estimating a DRM is more challenging since the episodes are obtained using the CDF of the cumulative reward, while DRM is an expectation with a distorted distribution implying an estimate of the underlying CDF is necessary, or a sample mean is not sufficient for DRM estimation.

For the purpose of gradient estimation, we employ the smoothed functional (SF) approach. This scheme falls under the realm of simultaneous perturbation methods \cite{shalabh_book}, which estimate the gradient of a function given noisy observations. Simultaneous perturbation methods in general, and SF methods in particular, are efficient and easy to implement as they require only two function measurements for estimating the gradient, irrespective of the parameter dimension. The choice of the SF scheme for estimating the gradient of an abstract SRM is not arbitrary. For some risk measures, it is not possible to employ the likelihood ratio method to arrive at a policy gradient theorem. This is true for the mean variance risk measure, as shown in \cite{prashanth2016mlj}. This is also unlike the classic expected value objective, for which one could use the policy gradient theorem to arrive at a gradient estimation scheme based on the likelihood ratio method.

We now summarize our contributions. First, we propose two template policy gradient algorithms with an SRM as the objective. The first algorithm operates in an on-policy RL setting, while the second caters to the off-policy RL setting. Second, we derive non-asymptotic bounds that quantify the rate of convergence of our proposed algorithms to a stationary point of an SRM. As special cases, we establish that our algorithms and associated theoretical guarantees apply to optimization of mean-variance and distortion risk measures, respectively, in a risk-sensitive RL context. To the best of our knowledge, policy gradient algorithm with non-asymptotic convergence guarantees are not available in the literature for SRMs in general, and for the special cases of mean-variance risk measure and DRMs in particular. Our non-asymptotic bound for the template algorithm can be used as a blackbox to characterize the convergence rate for SRMs beyond mean-variance and DRM. In particular, one can arrive at a $O(1/\epsilon^{2})$ bound on the number of iterations for convergence to an $\epsilon$-stationary point of the SRM, provided one verifies the necessary assumptions that guarantee smoothness of SRM and a MSE bound on the SRM estimators.

\textbf{Related work.}
In \cite{tamar2015}, the authors propose a policy gradient algorithm for an abstract coherent risk measure, and derive a policy gradient theorem using the dual representation of a coherent risk measure. Their estimation scheme requires solving a convex optimization problem. Also, they establish asymptotic consistency of their proposed gradient estimate. In contrast, our estimation scheme is computationally inexpensive, and our theoretical guarantees are non-asymptotic in nature.
In \cite{prashla2021}, the authors survey policy gradient algorithms for optimizing different risk measures in a constrained as well as an unconstrained RL setting.
They provide a non-asymptotic bound of $O(1/N^{1/3})$ for an abstract smooth risk measure, assuming a gradient oracle that satisfies certain bias-variance conditions. In contrast, we provide concrete gradient estimation schemes in a risk-sensitive RL setting, and more importantly, we derive an improved non-asymptotic bound of order $O(1/\sqrt{N})$.
In \cite{prashla16} the authors consider a CPT-based objective in an RL setting, and they employ simultaneous perturbation stochastic approximation (SPSA) method for the gradient estimation, and provide asymptotic convergence guarantees for their algorithm. The optimization of a DRM is closely related to that of CPT. Under general conditions on the policy parameterization, which are usually employed in the analysis of policy gradient algorithms, we show that DRM is smooth, in turn leading a non-asymptotic bound of $O(1/\sqrt{N})$. This is unlike \cite{prashla16}, where the authors provide asymptotic guarantees assuming the policy parameterization ensures that the CPT-value is three times continuously differentiable --- a condition that is hard to verify in practice.
In a non-RL context, the authors in \cite{glynn21} study the sensitivity of DRM using an estimator that is based on the generalized likelihood ratio method, and establish a central limit theorem for their gradient estimator. In \cite{holland22}, the authors analyze the optimization of spectral risk measures in an empirical risk minimization framework that assumes convex losses.

The rest of the paper is organized as follows: Section \ref{sec:prelims} provides the preliminaries for a risk-sensitive episodic problem. Section \ref{sec:sf} introduces our proposed policy gradient template for smooth risk measures. Section \ref{sec:main} presents the non-asymptotic bounds for our proposed algorithms. Section \ref{sec:app} outlines the application of our algorithms to two prominent examples of SRM, namely, DRM and MVRM. Finally, Section \ref{sec:conclusions} provides the concluding remarks.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Preliminaries}
\label{sec:prelims}
We consider an MDP with a state space $\mathscr{S}$ and an action space $\mathscr{A}$. We assume that $\mathscr{S}$ and $\mathscr{A}$ are finite spaces. Let $r:\mathscr{S}\times\mathscr{A}\times\mathscr{S}\to [-r_{\textrm{max}},r_{\textrm{max}}], r_{\textrm{max}}\in\mathbb{R}^{+}$ be the single stage scalar reward, and $p:\mathscr{S}\times\mathscr{S}\times\mathscr{A} \to [0,1]$ be the transition probability function. We consider episodic problems, where each episode starts at a fixed state $S_0$ and terminates at a special zero reward absorbing state $0$. The action selection is based on parameterized stochastic policies $\{\pi_\theta:\mathscr{S}\times\mathscr{A}\times\mathbb{R}^d\to[0,1],\theta\in\mathbb{R}^d\}$. We make the following assumptions on the parameterized policies $\{\pi_\theta,\theta\in\mathbb{R}^d\}$:
\begin{assumption}
    \label{as:proper}
    The policies $\{\pi_\theta,\theta\in\mathbb{R}^d\}$ are proper, i.e.,\\
    $\exists M>0: \max\limits_{s\in\mathscr{S}}\mathbb{P}\left(S_M \neq 0 | S_0=s,\pi_\theta \right)<1,\forall \theta \in \R^d$.
\end{assumption}
\begin{assumption}
    \label{as:nabla_logpi}
    $\exists M_{d},M_{h}>0: \forall \theta\in \R^d, \forall a\in \mathscr{A}, s \in \mathscr{S}$, $\left\lVert\nabla\log \pi_{\theta}(a\mid s)\right\lVert\leq M_d$, and $\left\lVert\nabla^2\log \pi_{\theta}(a\mid s)\right\lVert\leq M_h$,
    where $\lVert\cdot\rVert$ is the $d$-dimensional Euclidean norm when the operand is a vector, and the operator norm when the operand is a matrix.
\end{assumption}

Assumption \ref{as:proper} is commonly used in the analysis of episodic MDPs (cf. \cite{ndp_book}).
An assumption like \ref{as:nabla_logpi} is common for analyzing policy gradient algorithms (cf. \cite{zhangK2020,papini2018}). To illustrate the plausibility of \ref{as:nabla_logpi}, let us examine a policy that follows Gibbs distribution, i.e., $\pi_\theta(a|s)=\nicefrac{exp(h(s,a,\theta))}{\sum_{b\in\mathscr{A}}exp(h(s,b,\theta))}$, where $h:\mathscr{S}\times \mathscr{A}\times\R^d\to\R$ is a user defined function.
We can see that,
\begin{align*}
    &\nabla \log \pi_\theta(a|s)=\nabla h(s,a,\theta) - \sum_{b\in\mathscr{A}}\pi_\theta(b|s)\nabla h(s,b,\theta);\\
    &\nabla^2 \log \pi_\theta(a|s)=\nabla^2 h(s,a,\theta)\\
    & + \left(\sum_{b\in\mathscr{A}}\pi_\theta(b|s) \nabla h(s,b,\theta)\right)\!\left(\sum_{b\in\mathscr{A}}\pi_\theta(b|s) \nabla h(s,b,\theta)\right)^\top\\
    &-\sum_{b\in\mathscr{A}}\pi_\theta(b|s)\left(\nabla^2 h(s,b,\theta)+\nabla h(s,b,\theta)\nabla h(s,b,\theta)^\top\right).
\end{align*}
If we choose linear policy class, i.e., $h(s,a,\theta)=\phi(s,a)^\top\theta$, with bounded features, i.e., $\left\lVert\phi(s,a)\right\rVert \leq M$, then
$\left\lVert \nabla \log \pi_\theta(a|s)\right\rVert\leq\lvert\mathscr{A}\rvert M $, and
$\left\lVert\nabla^2 \log \pi_\theta(a|s)\right\rVert \leq   \lvert \mathscr{A}\rvert M^2+\lvert \mathscr{A}\rvert^2  M^2$. Since we consider finite state-action spaces, it is easy to arrive at constants $M_d$ and $M_h$ that ensure \ref{as:nabla_logpi} holds.

We denote by $S_t$ and $A_t$, the state and the action at time $t\in\{0,1,\cdots\}$ respectively. The cumulative discounted reward $R^\theta$, which is a random variable, is defined as follows:
\begin{align}
    R^\theta=\sum\limits_{t=0}^{T-1}\gamma^t r(S_t,A_t,S_{t+1}),  \forall \theta \in \R^d,
    \label{eq:Rtheta}
\end{align}
where $A_t \sim \pi_\theta(\cdot, S_t)$, $S_{t+1}\sim p(\cdot,S_t,A_t)$, $\gamma \in (0,1)$ is the discount factor, and $T$ is the random length of an episode. We can see that $\forall \theta \in \mathbb{R}^d,\lvert R^\theta \rvert < \frac{r_{\textrm{max}}}{1-\gamma}=M_r$ a.s.
From \ref{as:proper}, we infer that $\E[T]<\infty$. This fact in conjunction with $T\geq0$ implies the following bound:
\begin{align}
    \label{eq:M_pi}
    \exists M_e >0 : T \leq M_e < \infty \textrm{ a.s}.
\end{align}

On-policy learning is a scheme where a policy parameter $\theta$ is optimized using the data collected by the same policy $\pi_\theta$. In contrast, off-policy learning is a scheme where we optimize $\theta$ using data collected by a different behavior policy $b$.

In an off-policy setting, we collect episodes from $b$ and estimate the values of $\pi_\theta$, using importance sampling ratios. We require the behavior policy $b$ to be proper, i.e.,
\begin{assumption}
    \label{as:b_proper}
    $\exists M > 0:\; \max_{s\in\mathscr{S}}\mathbb{P}\left(S_M \neq 0 \mid S_0=s, b \right)<1$.
\end{assumption}
We also assume that the target policy $\pi_\theta$ is absolutely continuous w.r.t. the behavior policy $b$, i.e.,
\begin{assumption}
    \label{as:b_pol}
    $\forall \theta \in \!\R^d, b(a | s) \!=\!0 \Rightarrow \pi_\theta(a | s)\!=\!0,\forall a \in \mathscr{A}, \forall s \in \mathscr{S}$.
\end{assumption}
Assumption \ref{as:b_pol} is standard in an off-policy RL setting (cf. \cite{sutton08}).

The cumulative discounted reward $R^b$, which is a random variable, is defined as follows:
\begin{align}
    \label{eq:Rb}
    R^b=\sum\limits_{t=0}^{T-1}\gamma^t r(S_t,A_t,S_{t+1}),
\end{align}
where $A_t \sim b(\cdot, S_t)$, $S_{t+1}\sim p(\cdot,S_t,A_t)$, $\gamma \in (0,1)$, and $T$ is the random length of an episode. As before, \ref{as:b_proper} implies $\E[T]<\infty$, and the following bound:
\begin{align}
    \label{eq:M_b}
    \exists M_e >0 : T \leq M_e < \infty, \textrm{ a.s}.
\end{align}
The importance sampling ratio $\psi^\theta$ is defined by
\begin{align}
    \label{eq:psi}
    \psi^\theta = \prod\limits_{t=0}^{T-1}\frac{\pi_{\theta}(A_t\mid S_t)}{b(A_t\mid S_t)}.
\end{align}
From \ref{as:nabla_logpi} and \ref{as:b_pol}, we obtain $\forall \theta \in \R^d,\pi_{\theta}(a|s)>0$ and $b(a|s) >0$, $\forall a \in \mathscr{A}, \textrm{ and } \forall s \in \mathscr{S}$. This fact in conjunction with \eqref{eq:M_b} implies the following bound for $\psi^\theta$:
\begin{align}
    \label{eq:is_ratio}
    \exists M_s>0 : \forall \theta\in\R^d, \psi^\theta \leq M_s, \textrm{ a.s}.
\end{align}

The cumulative discounted reward is a random variable as there is randomness in state transition as modeled by the transition probability function as well as in the action selection in the case of stochastic policies. We consider a smooth risk measure $\rho$ as an objective function, which provides a numerical value that represents certain aspects of this random variable.
\begin{definition}
    A risk measure is smooth if it satisfies the following condition: There exists a positive constant $L_{\rho'}$ such that,
    \begin{align}
       \forall \theta_1, \theta_2 \in \R^d, \left\lVert \nabla\rho(\theta_1)-\nabla\rho(\theta_2)\right\rVert \leq L_{\rho'} \left\lVert \theta_1 \!-\! \theta_2 \right\rVert.
    \end{align}
\end{definition}
Under relatively general conditions, DRM and MVRM can be considered as instances of smooth risk measures. We establish this fact in Section \ref{sec:app}.

Our goal is to find a policy parameter $\theta^*$ that maximizes the objective function $\rho$, i.e,
\begin{align}
    \label{eq:max_theta}
    \theta^*\in\textrm{arg}\!\max_{\theta \in \mathbb{R}^d}  \rho(\theta).
\end{align}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Policy gradient template}
\label{sec:sf}
We propose two policy gradient algorithms for optimizing a smooth risk measure. The first algorithm operates in an on-policy RL setting, and Algorithm \ref{alg:onP} presents the pseudocode. The second algorithm caters to an off-policy RL setting, with a pseudocode that follows the template in Algorithm \ref{alg:onP} with variations in estimation.
There are two crucial ingredients in each of these policy gradient algorithms:
\begin{enumerate}
    \item Risk estimation: This refers to the problem of estimating the value of a smooth risk measure for a given policy parameter, say $\theta$. In an on-policy setting, the estimation scheme has access to a mini-batch of episodes  from the policy $\pi_\theta$ itself. On the other hand, in an off-policy setting, the estimation scheme has to use the episodes simulated using a behavior policy.
    \item Gradient estimation: This refers to the estimation of the policy gradient $\nabla \rho(\theta)$ for a given parameter $\theta$. Such an estimate would be used to perform stochastic gradient ascent in the policy parameter.
\end{enumerate}
The estimation scheme is specific to the risk measure considered. For the theoretical guarantees in the next section, we require the following bound on the estimate $\hat\rho_m(\theta)$ of the risk $\rho(\theta)$, given $m$ episodes: For some positive constant $C_1$,
\begin{align}
    \label{eq:msebound}
    \E\left[\left\lvert \hat{\rho}_m(\theta) - \rho(\theta) \right\rvert^2\right]\leq\frac{C_1}{m}.
\end{align}
The condition above relates to the mean-square error of the risk estimator, and the rate of $O(\nicefrac{1}{m})$ is natural, considering such a bound is reasonable even for the case of an expected value objective.
For the two applications with mean-variance and distortion risk measures, we shall establish later that the estimators of these risk measures satisfy the condition specified above.

For handling the problem of gradient estimation, both algorithms use an SF-based estimation scheme. The choice of this gradient estimation scheme is not arbitrary. The application of the likelihood ratio method to derive a policy gradient theorem is not viable for certain risk measures. This limitation is evident in the case of the mean-variance risk measure, as demonstrated in \cite[Lemma~1]{prashanth2016mlj}. Specifically, when considering the policy gradient expression for the squared value $\E\left[(R^\theta)^2\right]$, it incorporates the gradient of the value function at each state of the MDP. Consequently, this inclusion presents challenges in accurately estimating the gradient.
In the aforementioned reference, the authors employed SPSA, a popular simultaneous perturbation method to workaround the policy gradient expression. In our work, we use SF, which also falls under the realm of simultaneous perturbation methods for gradient estimation. Moreover, unlike \cite{prashanth2016mlj}, we consider a broad class of smooth risk measures, and more importantly, we establish non-asymptotic bounds that quantify the rate of convergence of our proposed SF-based policy gradient algorithms.

The SF-based gradient estimation is a zeroth-order gradient estimation scheme, where the gradient is estimated from perturbed function values (cf. \cite{nesterov2017,shalabh_book, shamir}). The SF method forms a smoothed version of the objective function $\rho(\cdot)$ as $\rho_{\mu}(\cdot)$ and uses the gradient $\nabla \rho_{\mu}$ as an approximation for $\nabla \rho$. The smoothed functional $\rho_{\mu}(\theta)$ is defined as
\begin{align}
    \label{eq:rho_mu}
    \rho_{\mu}(\theta) = \mathbb{E}_{u \in \mathbb{B}^d}\left[\rho({\theta+\mu u})\right],
\end{align}
where $u$ is sampled uniformly at random from the unit ball $\mathbb{B}^d=\{x\in\mathbb{R}^d \mid \lVert x \rVert \leq 1\}$, and $\mu \in (0,1]$ is the smoothing parameter.
From \cite[Lemma 2.1]{flaxman}, we obtain the following expression for the gradient of $\rho_{\mu}(\theta)$.
\begin{align}
    \label{eq:del_rho_mu}
    \nabla\rho_{\mu}(\theta)=\mathbb{E}_{v\in\mathbb{S}^{d-1}}\left[\frac{d}{\mu}\rho({\theta+\mu v})v\right],
\end{align}
where $v$ is sampled uniformly at random from the unit sphere $\mathbb{S}^{d-1}=\{x\in\mathbb{R}^d \mid \lVert x \rVert = 1\}$.
In a deterministic optimization setting with  perfect measurements of $\rho(\cdot)$, the gradient $\nabla\rho_{\mu}(\theta)$ is estimated as follows:
\begin{align}
    \label{eq:hat_nabla_rho_0}
    \widehat{\nabla}_{\mu,n}\rho(\theta) = \frac{d}{n}\sum\limits_{i=1}^{n} \frac{\rho({\theta+\mu v_i}) - \rho({\theta - \mu v_i})}{2\mu}v_i,
\end{align}
where $\forall i, v_i$ is sampled uniformly at random from $\mathbb{S}^{d-1}$. The gradient estimate is averaged over $n$ unit vectors to reduce the variance.
Using the proof technique from \cite{nv1}, we show that $\widehat{\nabla}_{\mu,n}\rho(\theta)$ is an unbiased estimator of $\nabla\rho_{\mu}(\theta)$, see Appendix~A for the details.

In a typical RL setting, we may not have direct measurements of $\rho(\cdot)$, which need to be estimated using sample episodes. Let  $\hat{\rho}_m(\cdot)$ be the estimator for $\rho(\cdot)$, then we use a gradient estimator as given below:
\begin{align}
    \label{eq:hat_nabla_hat_rho}
    \widehat{\nabla}_{\mu,n}\hat{\rho}_m(\theta) = \frac{d}{n}\sum\limits_{i=1}^{n} \frac{\hat{\rho}_m({\theta+\mu v_i}) - \hat{\rho}_m({\theta - \mu v_i})}{2\mu}v_i.
\end{align}
We solve \eqref{eq:max_theta} using the following update iteration:
\begin{align}
    \label{eq:approx_theta_update}
    \theta_{k+1} = \theta_k + \alpha \widehat{\nabla}_{\mu,n} \hat{\rho}_m({\theta_k}),
\end{align}
where $\theta_0$ is set arbitrarily, and $\alpha$ is the step-size.

We consider two algorithms, both armed with a risk estimator $\hat{\rho}_m(\cdot)$ and a risk gradient estimate using SF. In our first algorithm OnP-SF, $\hat{\rho}_m(\cdot)$ uses an on-policy evaluation.  Algorithm \ref{alg:onP} presents the pseudocode of OnP-SF.
\begin{algorithm}[ht]
    \caption{OnP-SF}
    \label{alg:onP}
    \begin{algorithmic}[1]
        \STATE \textbf{Input}: Parameterized form of the policy $\pi$, iteration limit $N$, step-size $\alpha$, perturbation parameter $\mu$, and batch sizes $m$ and $n$;
        \STATE \textbf{Initialize}: Target policy parameter $\theta_{0} \in \mathbb{R}^d$, and the discount factor $\gamma \in (0,1)$;
        \FOR {$k=0,\hdots, N-1$ }
        \FOR {$i=1,\hdots, n$ }
        \STATE Get $[v_i^1, \hdots, v_i^d] \in \mathbb{S}^{d-1}$;
        \STATE Generate $m$ episodes each using $\pi_{(\theta_k \pm \mu v_i)}$;
        \STATE Estimate $\hat{\rho}_m({\theta_k \pm \mu v_i)}$;
        \ENDFOR
        \STATE Use \eqref{eq:hat_nabla_hat_rho} to estimate $\widehat{\nabla}_{\mu,n} \hat{\rho}_m({\theta_k})$;
        \STATE Use \eqref{eq:approx_theta_update} to calculate $\theta_{k+1}$;
        \ENDFOR
        \STATE \textbf{Output}: Policy $\theta_R$, where $R \sim \mathcal{U}\{0,N-1\}$
    \end{algorithmic}
\end{algorithm}

Each iteration of OnP-SF requires $2mn$ episodes corresponding to $2n$ perturbed policies. In some practical applications, it may not be feasible to generate system trajectories corresponding to different perturbed policies. In our second algorithm OffP-SF, we overcome the aforementioned problem by performing the off-policy evaluation. Using the off-policy setting, the number of episodes needed in each iteration of our algorithm can be reduced to $m$.
The pseudocode of OffP-SF is similar to Algorithm \ref{alg:onP} with the following deviations: The estimate $\hat{\rho}_m({\theta_k \pm \mu v_i)}$ in step 7 is performed in a off-policy fashion, and for this purpose $m$ episodes are generated only once using the behavior policy. In contrast, step 6 in Algorithm \ref{alg:onP} requires simulation of $m$ episodes in each iteration using the current policy parameter $\pi_{\theta_k \pm \mu v_i}$.
% \begin{algorithm}[h]
    %     \caption{OffP-SF}
    %     \label{alg:offP_sf}
    %     \begin{algorithmic}[1]
        %         \STATE \textbf{Input}: Parameterized form of the policy $\pi_\theta$, behavior policy b, iteration limit $N$, step-size $\alpha$, smoothing parameter $\mu$, and batch sizes $m$ and $n$;
        %         \STATE \textbf{Initialize}: Policy $\theta_{0} \in \mathbb{R}^d$, and discount factor $\gamma \in (0,1)$;
        %         \FOR {$k=0,\hdots, N-1$ }
        %         \STATE Get $m$ episodes from $b$;
        %         \FOR {$i=1,\hdots, n$ }
        %         \STATE Get $[v_i^1, \hdots, v_i^d] \in \mathbb{S}^{d-1}$;
        %         \STATE Estimate $\hat{\rho}_m({\theta_k \pm \mu v_i)}$;
        %         \ENDFOR
        %         \STATE Use \eqref{eq:hat_nabla_hat_rho} to estimate $\widehat{\nabla}_{\mu,n} \hat{\rho}_m({\theta_k})$;
        %         \STATE Use \eqref{eq:approx_theta_update} to calculate $\theta_{k+1}$;
        %         \ENDFOR
        %         \STATE \textbf{Output}: Policy $\theta_R$, where $R\sim\mathcal{U}\{0,N-1\}$.
        %     \end{algorithmic}
    % \end{algorithm}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Main results}
\label{sec:main}
Our non-asymptotic analysis establishes a bound on the number of iterations of our proposed algorithms to find an $\epsilon$-stationary point of the smooth risk measure, which is defined below.
\begin{definition}[$\epsilon$-stationary point]
    \label{def:esolution}
    Fix $\epsilon>0$. Let $ \theta_R$ be the random output of an algorithm. Then, $\theta_R$ is called an $\epsilon$-stationary point of problem \eqref{eq:max_theta}, if $\mathbb{E}\left[ \left\Vert \nabla \rho \left( \theta_R \right) \right\rVert^2\right] \leq \epsilon$, where the expectation is over $R$.
\end{definition}
For a non-convex objective function, it is common in optimization literature to establish a convergence rate result to an $\epsilon$-stationary point. Such a convergence notion is used in the analysis of policy gradient algorithms as well, cf.  \cite{papini2018,shen2019hessian,zhangK2020}.

\subsection{Bounds for OnP-SF/OffP-SF}
We  make the following assumptions for the sake of analysis.
\begin{assumption}
    \label{as:func_bound}
    $\forall \theta \in \R^d$, $\hat{\rho}_m(\theta)$ and $\rho(\theta)$ are bounded.
\end{assumption}
\begin{assumption}
    \label{as:mse}
    $\forall \theta \in \R^d$, $\E\left[\left\lvert \hat{\rho}_m(\theta) - \rho(\theta) \right\rvert^2\right]\leq\frac{C_1}{m}$.
\end{assumption}
\begin{assumption}
    \label{as:lip}
    $\forall \theta_1, \theta_2 \in \R^d$, $\left\lvert \rho(\theta_1)-\rho(\theta_2)\right\rvert \leq L_{\rho} \left\lVert \theta_1 - \theta_2 \right\rVert$.
\end{assumption}
\begin{assumption}
    \label{as:smooth}
    $\forall \theta_1, \theta_2 \in \R^d$, $\left\lVert \nabla \rho (\theta_1)-\nabla \rho (\theta_2)\right\rVert \leq L_{\rho'} \left\lVert \theta_1 \!-\! \theta_2 \right\rVert$.
\end{assumption}

We present bounds for an iterate $\theta_R$ that is chosen uniformly at random from $\{\theta_0,\cdots,\theta_{N-1}\}$. The bound that we present below applies to the template algorithm for on-policy as well as off-policy RL settings. Moreover, the bound below is for a general step-size, smoothing parameter and batch size parameters. Subsequently, we specialize this result to arrive at a $O(\nicefrac{1}{\sqrt{N}})$ bound on $\mathbb{E}\left[ \left\Vert \nabla \rho \left( \theta_R \right) \right\rVert^2\right]$. The proofs can be found in Appendix~A or in the complete version of this paper accessible through \cite{nv2023}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{proposition}(OnP-SF/OffP-SF)
    \label{pr:non_asym_sf}
    Assume \ref{as:func_bound}-\ref{as:smooth}. Let $\{\theta_i,i=0,\cdots,N-1\}$ be the policy parameters generated by OnP-SF/OffP-SF, and let $\theta_R$ be chosen uniformly at random from this set. Let $\rho^*=\max_{\theta\in\R^d}\rho(\theta)$. Then
    \begin {align}
    \label{eq:tm_sf}
    &\mathbb{E}\left[\left\lVert \nabla \rho(\theta_R)\right\rVert^2\right]
    \leq \frac{2 \left(\rho^* - \rho(\theta_{0})\right)}{N \alpha}+ \mu^2 d^2 L_{\rho'}^2 \nonumber\\
    &+L_{\rho'} \alpha \left(\frac{2d^2 L_{\rho}^2}{n}+  \frac{d^2C_1}{\mu^2mn}\right)
    + \frac{4d^2L_\rho^2}{n}+ \frac{d^2C_1}{\mu^2mn},
    \end {align}
    where $L_\rho,L_\rho'$, and $C_1$ are as in \ref{as:func_bound}-\ref{as:smooth}.
\end{proposition}
A straightforward specialization of the bound in \eqref{eq:tm_sf} with specific choices for the step-size $\alpha$, smoothing parameter $\mu$, and batch sizes $m$ and $n$ leads to following bounds for OnP-SF and OffP-SF algorithms, respectively.
\begin{theorem}(OnP-SF)
    \label{tm:onP_sf}
    Set $\alpha=\frac{1}{\sqrt{N}}$, $\mu=\frac{1}{\sqrt[4]{N}}$, $n=\sqrt{N}$, and $m=\sqrt{N}$. Then, under the conditions of Proposition \ref{pr:non_asym_sf}, we have
    \begin {align*}
    &\mathbb{E}\left[\left\lVert \nabla \rho(\theta_R)\right\rVert^2\right]
    \leq \frac{2 \left(\rho^* - \rho(\theta_{0})\right)}{\sqrt{N}} + \frac{2d^2 L_{\rho}^2L_{\rho'}}{N\sqrt{N}}\\
    &\quad +  \frac{4d^2L_\rho^2+d^2C_1L_{\rho'}}{N}
    + \frac{d^2C_1+ d^2 L_{\rho'}^2}{\sqrt{N}} .
    \end {align*}
\end{theorem}
\begin{theorem}(OffP-SF)
    \label{tm:offP_sf}
    Set $\alpha=\frac{1}{\sqrt{N}}$, $\mu=\frac{1}{\sqrt[4]{N}}$, $n=N$, and $m=C_2>0$. Then, under the conditions of Proposition \ref{pr:non_asym_sf}, we have
    \begin {align*}
    &\mathbb{E}\left[\left\lVert \nabla \rho(\theta_R)\right\rVert^2\right]
    \leq \frac{2 \left(\rho^* - \rho(\theta_{0})\right)}{\sqrt{N}}
    + \frac{2d^2 L_{\rho}^2L_{\rho'}}{N\sqrt{N}}\\
    &\quad+  \frac{C_24d^2L_\rho^2+d^2C_1L_{\rho'}}{C_2N}+ \frac{d^2C_1+C_2d^2 L_{\rho'}^2}{C_2\sqrt{N}}.
    \end {align*}
\end{theorem}
% In the above, $C_2$ is a constant which can be tuned in accordance with the problem in hand.
\begin{remark}
    The results above show that after $N$ iterations of \eqref{eq:approx_theta_update}, OnP-SF/OffP-SF return an iterate that satisfies $\E\!\left[\left\lVert\nabla\rho(\theta_R)\right\rVert^2\right]=O\left(\nicefrac{1}{\sqrt{N}}\right)$. To put it differently, to find an $\epsilon$-stationary point of the smooth risk measure objective, an order $O(\nicefrac{1}{\epsilon^2})$ iterations of OnP-SF/OffP-SF are enough.
\end{remark}

\begin{remark}
    The bounds obtained for OnP-SF and OffP-SF are $O(\nicefrac{1}{\sqrt{N}})$, but  with different choices for the parameters $\alpha, \mu, n, m$.
    One could vary the parameters $n,m$ for OnP-SF such that their product remains $N$, and still obtain the $O(\nicefrac{1}{\sqrt{N}})$ bound. The implication is that OnP-SF requires $\Theta(N)$ episodes to achieve the aforementioned rate.
    On the other hand, one can choose a constant batch size in OffP-SF and increase the parameter $n$ to be $\Theta(N)$ to arrive at a overall convergence rate of $O(\nicefrac{1}{\sqrt{N}})$. In the off-policy setting, with a fixed dataset, one could increase the parameter $n$ for higher averaging in the gradient estimate, and such a scheme would not entail simulation of additional episodes.
\end{remark}
\begin{remark}
    Typical results in risk-sensitive RL literature are produces asymptotic in nature. An exception is a result from \cite{prashla2021}, where non-asymptotic bounds of $O\left(\nicefrac{1}{N^{1/3}}\right)$ are presented. In contrast, we derive $O(\nicefrac{1}{\sqrt{N}})$ bounds for SRMs.
\end{remark}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Applications}
\label{sec:app}
Under relatively general conditions, DRM and MVRM can be considered as instances of smooth risk measures. We describe these risk measures in the following sections.
\subsection{Distortion risk measures (DRM)}
\label{subsec:drm}
The DRM of  $R^\theta$, defined in \eqref{eq:Rtheta} is the expected value of $R^\theta$ under a distortion of the CDF  $F_{R^{\theta}}$, attained using a given distortion function $g(\cdot)$.  We denote by $\rho_g(\theta)$ the DRM of $R^\theta$, and is defined as follows:
\begin{align}
    \label{eq:rho_g_1}
    \rho_g(\theta)\!=\!\!\int_{-M_r}^{0}\!\!\!\!(g(1\!-\!F_{R^{\theta}}(x))\!-\!1) dx + \!\!\int_{0}^{M_r}\!\!\!\! g(1\!-\!F_{R^{\theta}}(x))dx.
\end{align}
The distortion function $g:[0,1]\to[0,1]$ is non-decreasing, with $g(0)=0$ and $g(1)=1$. We can see that $\rho_g(\theta)\!=\!\E[R^\theta]$, if $g(\cdot)$ is the identity function.
A few examples of $g(\cdot)$ are available in Table \ref{tb:g} and their plots are in Figure \ref{fg:g}.
\begingroup
\renewcommand{\arraystretch}{1.25}
\begin{table}[t]
    \caption{Examples of distortion functions}
    \label{tb:g}
    \begin{center}
        \begin{small}
            \begin{tabular}{ll}
                \toprule
                Dual-power function & $g(s)=1-(1-s)^r$, $r\geq 2$\\
                Quadratic function    & $g(s)=(1+r) s- r s^2$, $0\leq r\leq 1$\\
                Exponential function &$g(s)=\frac{1-\exp(-r s)}{1-\exp(-r )}$, $r>0$\\
                Square-root function  & $g(s)=\frac{\sqrt{1+r s}-1}{\sqrt{1+r }-1}$, $r>0$\\
                Logarithmic function  & $g(s)=\frac{\log(1+r s)}{\log(1+r )}$, $r>0$\\
                \bottomrule
            \end{tabular}
        \end{small}
    \end{center}
   % \vskip -0.2in
\end{table}
%\endgroup
\begin{figure}[t]
    \begin{center}
        \centerline{\includegraphics[width=0.8\columnwidth]{g_plots.png}}
        \caption{Examples of distortion functions}
        \label{fg:g}
    \end{center}
   % \vskip -0.2in
\end{figure}

The limit of the integration in \ref{eq:rho_g_1}, $M_r = \frac{r_{\textrm{max}}}{1-\gamma}$ or any problem specific upper bound for $\left\lvert R^\theta \right\rvert$.
As $g(\cdot) \in [0,1]$, we can infer the following bound on $ \rho_g(\cdot) $:
\begin{align}
    \label{eq:rho_g_bound}
    \left\lvert \rho_g(\cdot) \right\rvert\leq 2M_r.
\end{align}
The inequality in \eqref{eq:rho_g_bound} partially satisfies the conditions specified by \ref{as:func_bound} for the DRM.

Recall that the optimization problem in \eqref{eq:max_theta} is solved using stochastic gradient algorithm, and for each update iteration, we require estimates of $\rho_g(\cdot)$. In the following sections, we describe our algorithms that estimate DRM in on-policy and off-policy RL settings, respectively.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{On-policy DRM estimation}
\label{subsubsec:onpolicy}
We generate $m$ episodes using the policy $\pi_\theta$, and estimate the CDF $F_{R^{\theta}}(\cdot)$ using sample averages.
We denote by $R^{\theta}_i$ the cumulative reward of the episode $i$.
We form the estimate $G^m_{R^{\theta}}(\cdot)$ of $F_{R^{\theta}}(\cdot)$ as follows:
\begin{align}
    \label{eq:G}
    G^m_{R^{\theta}}(x) = \frac{1}{m}\sum\limits_{i=1}^m \1\{R^{\theta}_i\leq x\}.
\end{align}
Now, we form an estimate $\hat{\rho}_g^G(\theta)$ of $\rho_g(\theta)$ as follows:
\begin{align}
    \label{eq:hat_rho_G}
    \hat{\rho}_g^G(\theta)\!=\!\!\!\int_{-M_r}^{0}\!\!\!\!\!(g(1\!-\!G^m_{R^{\theta}}(x))\!-\!1) dx +\!\! \int_{0}^{M_r}\!\!\!\!\! g(1\!-\!G^m_{R^{\theta}}(x))dx.
\end{align}
Comparing \eqref{eq:hat_rho_G} with \eqref{eq:rho_g_1}, it is apparent that we have used the empirical distribution function $G^m_{R^{\theta}}$ in place of the true CDF $F_{R^{\theta}}$. Similar to $\rho_g(\cdot)$, we can infer the following bound on $\hat{\rho}_g^G(\cdot)$:
\begin{align}
    \label{eq:hat_rho_G_bound}
    \left\lvert \hat{\rho}_g^G(\cdot) \right\rvert\leq 2M_r.
\end{align}
The inequality in \eqref{eq:hat_rho_G_bound} along with \eqref{eq:rho_g_bound} satisfies the conditions specified by \ref{as:func_bound} for the DRM in an on-policy RL setting.

We simplify \eqref{eq:hat_rho_G} in terms of order statistics as follows:
\begin{align}
    \label{eq:hat_rho_G1}
    \hat{\rho}_g^G(\theta)=  \sum\limits_{i=1}^{m} {R^\theta_{(i)}} \left(g\left(1\!-\! \frac{i\!-\!1}{m}\right) \!-\! g\left(1\!-\! \frac{i}{m}\right)\right),
\end{align}
where $R^\theta_{(i)}$ is the $i^{th}$ smallest order statistic of the samples $\{R^\theta_1,\cdots R^\theta_m\}$. The reader is referred to  Lemma~13 in Appendix~B for the proof. If we choose the distortion function as the identity function, then the estimator in \eqref{eq:hat_rho_G1} is merely the sample mean.

We make the following assumptions to ensure the Lipschitzness, and smoothness of the DRM $\rho_g$.
\begin{assumption}
    \label{as:g'_bound}
    $\exists M_{g'},M_{g''}>0: \forall t\in(0,1)$, $\lvert g'(t)\rvert \leq M_{g'}$, and $ \lvert g''(t) \rvert \leq M_{g''}$.
\end{assumption}
The assumption \ref{as:g'_bound} helps us establish that the distortion functions and its derivative are Lipschitz continuous.
A few examples of distortion functions, which satisfy \ref{as:g'_bound} are given in Table \ref{tb:g}.
%Since $g(\cdot)$ is bounded by definition, we can see that any $g(\cdot)$ whose second derivative is bounded, will have a bounded first derivative also.

A critical requirement for establishing convergence guarantee is a bound on the MSE of the risk estimation scheme, as given in \ref{as:mse}. The result below shows that this MSE requirement is met by the DRM estimator \eqref{eq:hat_rho_G}. The proof can be found in Appendix~B.

\begin{lemma}
    \label{lm:est_error_G}
    Assume \ref{as:proper}-\ref{as:nabla_logpi} and \ref{as:g'_bound}. Then,
    \begin{align*}
        \E\left[\left\lvert \rho_g(\theta)- \hat{\rho}_g^G(\theta)\right\rvert^2\right]\leq\frac{16M_r^2M_{g'}^2}{m}.
    \end{align*}
\end{lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Off-policy DRM estimation}
\label{subsubsec:drm_offpolicy}
We generate $m$ episodes using the policy $b$ to estimate the CDF $F_{R^{\theta}}(\cdot)$ using importance sampling.
We denote by $R^b_i$ the cumulative reward, and $\psi^\theta_i$ the importance sampling ratio of the episode $i$.
We form the estimate $H^m_{R^{\theta}}(\cdot)$ of $F_{R^{\theta}}(\cdot)$ as follows:
\begin{align}
    H^m_{R^{\theta}}(x) &= \min\{\hat{H}^m_{R^{\theta}}(x),1\}, \textrm{ where} \label{eq:H}\\
    \hat{H}^m_{R^{\theta}}(x) &= \frac{1}{m}\sum\limits_{i=1}^m\1\{R^{b}_i\leq x\}\psi^\theta_i.\label{eq:hatH}
\end{align}
In the above, $\hat{H}^m_{R^{\theta}}(x)$ is an empirical estimate of $F_{R^{\theta}}(x)$ as $F_{R^{\theta}}(x) = \E\left[\1\{R^b\leq x\}\psi^\theta  \right]$. Because of the importance sampling ratio, $\hat{H}^m_{R^{\theta}}(x)$ can get a value above $1$. Since we are estimating a CDF, we restrict $\hat{H}^m_{R^{\theta}}(x)$ to $H^m_{R^{\theta}}(x)$.

Now we form an estimate $\hat{\rho}_g^H(\theta)$ of $\rho_g(\theta)$ as
\begin{align}
    \label{eq:hat_rho_H}
    \hat{\rho}_g^H(\theta)\!=\!\!\!\int\nolimits_{-M_r}^{0}\!\!\!\!\!(g(1\!-\!H^m_{R^{\theta}}(x))\!-\!1) dx + \!\!\!\int\nolimits_{0}^{M_r}\!\!\!\!\! g(1\!-\!H^m_{R^{\theta}}(x))dx.
\end{align}
Similar to $\rho_g(\cdot)$, we can infer the following bound on $ \hat{\rho}_g^H(\cdot) $:
\begin{align}
    \label{eq:hat_rho_H_bound}
    \left\lvert \hat{\rho}_g^H(\cdot) \right\rvert\leq 2M_r.
\end{align}
The inequality in \eqref{eq:hat_rho_H_bound} along with \eqref{eq:rho_g_bound} satisfies the conditions specified by \ref{as:func_bound} for the DRM in an off-policy RL setting.

We can simplify \eqref{eq:hat_rho_H} in terms of order statistics as
\begin{align}
    \label{eq:hat_rho_H1}
    \hat{\rho}_g^H(\theta)&=R^b_{(1)}+ \sum_{i=2}^{m} {R^b_{(i)}} g\left(1\!-\! \min\left\{1,\frac{1}{m}\sum_{k=1}^{i-1}\psi^\theta_{(k)}\right\}\right)\nonumber\\
    & - \sum_{i=1}^{m-1}{R^b_{(i)}} g\left(1\!-\! \min\left\{1,\frac{1}{m}\sum_{k=1}^{i}\psi^\theta_{(k)}\right\}\right),
\end{align}
where $R^b_{(i)}$ is the $i^{th}$ smallest order statistic of the samples $\{R^b_1,\cdots R^b_m\}$, and $\psi^\theta_{(i)}$ is the importance sampling ratio of $R^b_{(i)}$. The reader is referred to Lemma~14 in Appendix~B for the proof.

A result in the spirit of Lemma \ref{lm:est_error_G} for the off-policy setting is given below. The result below shows that the MSE requirement in \ref{as:mse} is met by the DRM estimator \eqref{eq:hat_rho_H}. The proof can be found in Appendix~B.

\begin{lemma}
    \label{lm:est_error_H}
    Assume \ref{as:proper}-\ref{as:b_pol} and \ref{as:g'_bound}. Then,
    \begin{align*}
        \E\left[\left\lvert \rho_g(\theta)- \hat{\rho}_g^H(\theta)\right\rvert^2\right]\leq\frac{16M_r^2M_{g'}^2M_s^2}{m}.
    \end{align*}
\end{lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Convergence analysis}
\label{subsubsec:drm_conv}
First we show that the assumptions \ref{as:lip}-\ref{as:smooth} are satisfied for the DRM using the results from the following lemma (see Appendix~B for the proof).
\begin{lemma}
    \label{lm:rho_lip}
    $\forall \theta_1,\theta_2 \in \mathbb{R}^d$,
    \begin{align*}
        &\left\lvert \rho_g(\theta_1)- \rho_g(\theta_2)\right\rvert
        \leq L_\rho \left\lVert \theta_1 \!-\! \theta_2 \right\rVert; L_\rho=2M_rM_{g'}M_eM_d,\\
        & \left\lVert \nabla\rho_g(\theta_1) - \nabla \rho_g(\theta_2) \right\rVert \leq
        L_{\rho'} \left\lVert  \theta_1  - \theta_2\right\rVert;\\
        &\qquad  L_{\rho'}=2M_r M_e \left( M_h M_{g'}+M_eM_d^2 (M_{g'}+ M_{g''})\right).
    \end{align*}
\end{lemma}
The main result that establishes a non-asymptotic bound for Algorithm \ref{alg:onP} with DRM as the risk measure is given below.
\begin{corollary}(DRM-OnP-SF)
    \label{cr:drm_onP}
    Assume \ref{as:proper}-\ref{as:nabla_logpi} and \ref{as:g'_bound}. Then, under the conditions of Theorem \ref{tm:onP_sf}, we have
    \begin {align*}
    &\mathbb{E}\left[\left\lVert \nabla \rho_g(\theta_R)\right\rVert^2\right]
    \leq \frac{2 \left(\rho_g^* - \rho_g(\theta_{0})\right)}{\sqrt{N}} + \frac{2d^2 L_{\rho}^2L_{\rho'}}{N\sqrt{N}}\\
    &\quad +  \frac{4d^2L_\rho^2+d^2C_1L_{\rho'}}{N}
    + \frac{d^2C_1+ d^2 L_{\rho'}^2}{\sqrt{N}}.
    \end {align*}
    In the above, $\rho_g^*\!=\!\max_{\theta\in\mathbb{R}^d}\rho_g(\theta)$. The constants $C_1, L_{\rho}$, and $L_{\rho'}$ are as in Lemmas \ref{lm:est_error_G} and \ref{lm:rho_lip}, respectively.
\end{corollary}
\begin{proof}
    Lemma \ref{lm:est_error_G} implies \ref{as:mse} holds for DRM estimator.
    Lemma \ref{lm:rho_lip} implies the conditions in \ref{as:lip} and \ref{as:smooth} hold for DRM. From \eqref{eq:rho_g_bound} and \eqref{eq:hat_rho_G_bound}, we can see that
    the conditions in \ref{as:func_bound} is satisfied for DRM.
    The main claim now follows by an application of Theorem \ref{tm:onP_sf}.
\end{proof}
For the off-policy case, a non-asymptotic bound can be inferred from Theorem \ref{tm:offP_sf} in a similar fashion as the on-policy case, with
Lemma \ref{lm:est_error_H} in place of Lemma \ref{lm:est_error_G}, and \eqref{eq:hat_rho_H_bound} in place of \eqref{eq:hat_rho_G_bound}.
\begin{corollary}(DRM-OffP-SF)
    \label{cr:drm_offP}
    Assume \ref{as:proper}-\ref{as:g'_bound}. Then, under the conditions of Theorem \ref{tm:offP_sf}, we have
    \begin {align*}
    &\mathbb{E}\left[\left\lVert \nabla \rho_g(\theta_R)\right\rVert^2\right]
    \leq \frac{2 \left(\rho_g^* - \rho_g(\theta_{0})\right)}{\sqrt{N}}
    + \frac{2d^2 L_{\rho}^2L_{\rho'}}{N\sqrt{N}}\\
    &\quad+  \frac{C_24d^2L_\rho^2+d^2C_1L_{\rho'}}{C_2N}+ \frac{d^2C_1+C_2d^2 L_{\rho'}^2}{C_2\sqrt{N}}.
    \end {align*}
    In the above, $\rho_g^*,L_{\rho}$, and $L_{\rho'}$ are as in Corollary \ref{cr:drm_onP}. The constant $C_1$ is as in Lemma \ref{lm:est_error_H}.
\end{corollary}
\begin{remark}
    If we choose the distortion function as the identity function, then the estimator in \eqref{eq:rho_g_1} is merely  the sample mean, and we recover the guarantees for a risk-neutral policy gradient algorithm. In particular, our bounds match the guarantees given by \cite{nv1}, which employs an SF-based gradient estimation scheme in a risk-neutral setting, and establishes consistency with the bounds of the REINFORCE style policy gradient algorithm.
\end{remark}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Mean-variance risk measure (MVRM)}
\label{subsec:mvrm}
The MVRM $\rho_{\lambda}(\theta)$ of $R^\theta$, defined in \eqref{eq:Rtheta},  is given by
\begin{align}
    \label{eq:rho_lambda}
    &\rho_\lambda(\theta)=J(\theta)-\lambda V(\theta), \textrm{ where} \nonumber\\
    &J(\theta)=\E\left[R^\theta\right];\;V(\theta)=\E\left[\left(R^\theta\right)^2\right]-J(\theta)^2.
\end{align}
In the above, $J(\theta)$ is the value function, which is the objective in a risk-neutral RL setting. Further, $V(\theta)$ is the variance of the cumulative reward, and $\lambda$ is a scalar that is used to tradeoff between mean and variance. A popular risk measure in control literature is exponential utility, where the objective is $-\frac{1}{\lambda}\log\mathbb{E}[e^{-\lambda R^\theta}]$, with $R^\theta$ denoting the cumulative reward.
Using a first-order Taylor expansion, it is apparent that
\[-\frac{1}{\lambda}\log\mathbb{E}[e^{-\lambda R^\theta}] = \mathbb{E}[R^\theta] - \frac{\lambda}{2}\text{Var}[R^\theta]+O(\lambda^2).\]
Thus, the MVRM risk measure defined above can be seen as an approximation to the exponential utility risk measure. Optimizing the latter risk measure in an RL context is challenging, and to the best of our knowledge, there is no RL algorithm with a compact parameterization for this problem. Instead of using a parameterized family of policies, the authors in \cite{prashla2021} adopt a different approach by treating the policy as a probability vector over all states and actions. Further, they introduce a two timescale tabular algorithm using Q-values within the context of an average-cost MDP setting (see Section 7.1 of \cite{prashla2021} for the details). In contrast, we present a policy gradient algorithm for MVRM with a provable bound on the rate for stationary convergence.

It is easy to see that $\left\lvert J(\theta)\right\rvert \leq M_r$ and $V(\theta) \leq M_r^2$. Hence we could infer the following bound on $\rho_\lambda(\cdot)$:
\begin{align}
    \label{eq:rho_lambda_bound}
    \left\lvert\rho_\lambda(\cdot)\right\rvert\leq M_r(1+\lambda M_r).
\end{align}
The inequality in \eqref{eq:rho_lambda_bound} partially satisfies the conditions specified by \ref{as:func_bound} for the MVRM.

Next, we describe the estimation of the MVRM in on-policy and off-policy settings, respectively.
\subsubsection{On-policy MVRM estimation}
\label{subsubsec:mvrm_onpolicy}
We generate $m$ episodes using the policy $\pi_\theta$, and estimate $J(\theta)$ and  $V(\theta)$ using sample averages.
We denote by $R^{\theta}_i$ the cumulative reward of the episode $i$. The estimators $\hat{J}_m^{\pi}$ of $J(\theta)$ and $\widehat{V}_m^{\pi}$ of $V(\theta)$ is defined as follows:
\begin{align}
    \label{eq:Jm_pi}
    &\hat{J}_m^{\pi}(\theta)=\frac{1}{m}\sum\limits_{i=1}^{m} R^\theta_i;\\
    \label{eq:Vm_pi}
    &\widehat{V}_m^{\pi}(\theta)=\frac{1}{m-1}\sum\limits_{i=1}^{m} \left(R^\theta_i-\hat{J}_m^{\pi}\right)^2.
\end{align}
Using Theorem 2-3 in \cite[chapter V1]{mood74}, we can see that the above estimates are unbiased.

Using \eqref{eq:Jm_pi} and \eqref{eq:Vm_pi}, we estimate $\rho_\lambda(\theta)$ as follows:
\begin{align}
    \label{eq:hat_rho_lambda_pi}
    &\hat{\rho}_\lambda^{\pi}(\theta) = \hat{J}_m^{\pi}(\theta)-\lambda\widehat{V}_m^{\pi}(\theta).
\end{align}
We can see that $\left\lvert \hat{J}_m^{\pi}(\cdot)\right\rvert \leq M_r$ and $\widehat{V}_m^{\pi}(\cdot) \leq 8M_r^2$. Hence we could infer the following bound on $\hat{\rho}_\lambda^{\pi}(\cdot)$:
\begin{align}
    \label{eq:hat_rho_lambda_pi_bound}
    \left\lvert\hat{\rho}_\lambda^{\pi}(\cdot)\right\rvert\leq M_r(1+8\lambda M_r).
\end{align}
The inequality in \eqref{eq:hat_rho_lambda_pi_bound} along with \eqref{eq:rho_lambda_bound} satisfies the conditions specified by \ref{as:func_bound} for the MVRM in an on-policy RL setting.

The result below for the mean-variance estimator \eqref{eq:hat_rho_lambda_pi} satisfies an MSE bound of order $O(1/m)$, in turn verify \ref{as:mse}. The proof can be found in Appendix~C.
\begin{lemma}
    \label{lm:est_error_mvrm_pi}
    Assume \ref{as:proper}-\ref{as:nabla_logpi}, and let $m>2$. Then
    \begin{align*}
        \E\left[\left\lvert \hat{\rho}_\lambda^{\pi}(\theta)- \rho_\lambda(\theta) \right\rvert^2\right]
        \leq \frac{8M_r^2+32\lambda^2M_r^4}{m}.
    \end{align*}
\end{lemma}
\subsubsection{Off-policy MVRM estimation}
\label{subsec:mvrm_offpolicy}
We generate $m$ episodes using the policy $b$ to estimate $J(\theta)$ using importance sampling.
We denote by $R^b_i$ the cumulative reward, and $\psi^\theta_i$ the importance sampling ratio of the episode $i$. Since $J(\theta)=\E\left[R^b\psi^\theta\right]$, we estimate it using sample average as follows:
\begin{align}
    \label{eq:Jm_b}
    &\hat{J}_m^{b}(\theta)=\frac{1}{m}\sum\limits_{i=1}^{m} R^b_i \psi^\theta_i;\\
    %\end{align}
    %\begin{align}
    \label{eq:Vm_b}
    &\widehat{V}_m^{b}(\theta)=\frac{1}{m-1}\sum\limits_{i=1}^{m} \left(R^b_i \psi^\theta_i -\hat{J}_m^{b}\right)^2.
\end{align}
As in the on-policy setting, these estimates are unbiased.

Now using \eqref{eq:Jm_b} and \eqref{eq:Vm_b}, we estimate $\rho_\lambda(\theta)$ as follows:
\begin{align}
    \label{eq:hat_rho_lambda_b}
    &\hat{\rho}_\lambda^{b}(\theta) = \hat{J}_m^{b}(\theta)-\lambda\widehat{V}_m^{b}(\theta).
\end{align}
Similar to on-policy case, we can see that $\left\lvert \hat{J}_m^{b}(\cdot)\right\rvert \leq M_rM_s$ and $\widehat{V}_m^{b}(\cdot) \leq 8M_r^2M_s^2$. Hence we could infer the following bound on $\hat{\rho}_\lambda^{b}(\cdot)$:
\begin{align}
    \label{eq:hat_rho_lambda_b_bound}
    \left\lvert\hat{\rho}_\lambda^{b}(\cdot)\right\rvert\leq M_rM_s(1+8\lambda M_rM_s).
\end{align}
The inequality in \eqref{eq:hat_rho_lambda_b_bound} along with \eqref{eq:rho_lambda_bound} satisfies the conditions specified by \ref{as:func_bound} for the MVRM in an off-policy RL setting.

A result in the spirit of Lemma \ref{lm:est_error_mvrm_pi} for the off-policy setting is given below. The result below for the mean-variance estimator \eqref{eq:hat_rho_lambda_b} satisfies an MSE bound of order $O(1/m)$, in turn verify \ref{as:mse}. The proof can be found in Appendix~C.
\begin{lemma}
    \label{lm:est_error_mvrm_b}
    Assume \ref{as:proper}-\ref{as:b_pol}, and let $m>2$. Then
    \begin{align*}
        \E\left[\left\lvert \hat{\rho}_\lambda^{b}(\theta)- \rho_\lambda(\theta) \right\rvert^2\right]
        \leq \frac{8M_r^2M_s^2+32\lambda^2M_r^4M_s^4}{m}.
    \end{align*}
\end{lemma}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Convergence analysis}
\label{subsubsec:mvrm_conv}
We specialize the result in Proposition \ref{pr:non_asym_sf} to MVRM. Though MVRM is previously analyzed in \cite{tamar2012policy,prashla13}, they only provide asymptotic convergence results.
In the following lemma, we show that the assumptions \ref{as:lip}-\ref{as:smooth} are satisfied for the MVRM. The proof can be found in Appendix~C.
\begin{lemma}
    \label{lm:lip_rho_lambda}
    $\forall \theta_1,\theta_2 \in \R^d$,
    \begin{align*}
        & \left\lvert \rho_\lambda(\theta_1)-\rho_\lambda(\theta_1)\right\rvert
        \leq L_{\rho} \left\lVert \theta_1 - \theta_2 \right\rVert;\\
        &\qquad L_{\rho}=M_rM_eM_d+ 3\lambda M_r^2 M_e M_d, \textrm{ and}\\
        &\left\lVert \nabla \rho_\lambda(\theta_1)-\nabla \rho_\lambda(\theta_1)\right\rVert
        \leq  L_{\rho'} \left\lVert \theta_1 - \theta_2 \right\rVert;\\
        &\qquad L_{\rho'} = M_rM_e\left(M_h+M_eM_d^2\right)\\
        &\qquad\qquad +\lambda M_r^2M_e\left(3M_h+5 M_eM_d^2\right).
    \end{align*}
\end{lemma}
\begin{corollary}(MVRM-OnP-SF)
    \label{cr:mvrm_onP}
    Assume \ref{as:proper}-\ref{as:nabla_logpi}. Let the batch size $m=\sqrt{N}>2$. Then, under the conditions of Theorem \ref{tm:onP_sf}, we have
    \begin {align*}
    &\mathbb{E}\left[\left\lVert \nabla \rho_\lambda(\theta_R)\right\rVert^2\right]
    \leq \frac{2 \left(\rho_\lambda^* - \rho_\lambda(\theta_{0})\right)}{\sqrt{N}} + \frac{2d^2 L_{\rho}^2L_{\rho'}}{N\sqrt{N}}\\
    &\quad +  \frac{4d^2L_\rho^2+d^2C_1L_{\rho'}}{N}
    + \frac{d^2C_1+ d^2 L_{\rho'}^2}{\sqrt{N}}.
    \end {align*}
    In the above, $\rho_{\lambda}^*\!=\!\max_{\theta\in\mathbb{R}^d}\rho_{\lambda}(\theta)$. The constants $C_1, L_{\rho}$, and $L_{\rho'}$ are as in Lemmas \ref{lm:est_error_mvrm_pi} and \ref{lm:lip_rho_lambda}, respectively.
\end{corollary}
\begin{proof}
    Lemma \ref{lm:est_error_mvrm_pi} implies \ref{as:mse} holds for MVRM estimator.
    Lemma \ref{lm:lip_rho_lambda} implies the conditions in \ref{as:lip} and \ref{as:smooth} hold for MVRM.
    From \eqref{eq:rho_lambda_bound} and \eqref{eq:hat_rho_lambda_pi_bound}, we can see that the conditions in \ref{as:func_bound} is satisfied for MVRM.
    The main claim now follows by an application of Theorem \ref{tm:onP_sf}.
\end{proof}
The bound for the off-policy variant follows by using
Lemma \ref{lm:est_error_mvrm_b} in place of Lemma \ref{lm:est_error_mvrm_pi}, and \eqref{eq:hat_rho_lambda_b_bound} in place of \eqref{eq:hat_rho_lambda_pi_bound}.
\begin{corollary}(MVRM-OffP-SF)
    \label{cr:mvrm_offP}
    Assume \ref{as:proper}-\ref{as:b_pol}. Let $C_2>2$. Then, under the conditions of Theorem \ref{tm:offP_sf}, we have
    \begin {align*}
    &\mathbb{E}\left[\left\lVert \nabla \rho_\lambda(\theta_R)\right\rVert^2\right]
    \leq \frac{2 \left(\rho_{\lambda}^* - \rho_\lambda(\theta_{0})\right)}{\sqrt{N}}
    + \frac{2d^2 L_{\rho}^2L_{\rho'}}{N\sqrt{N}}\\
    &\quad+  \frac{C_24d^2L_\rho^2+d^2C_1L_{\rho'}}{C_2N}+ \frac{d^2C_1+C_2d^2 L_{\rho'}^2}{C_2\sqrt{N}}.
    \end {align*}
    In the above, $\rho_{\lambda}^*,L_{\rho}$, and $L_{\rho'}$ are as in Corollary \ref{cr:mvrm_onP}. The constant $C_1$ is as in Lemma \ref{lm:est_error_mvrm_b}.
\end{corollary}
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Conclusions}
\label{sec:conclusions}
We proposed two policy gradient algorithms that cater to the broad class of smooth risk measures. Both algorithms employed an SF-based gradient estimation scheme, and were shown to work in on-policy as well as off-policy RL settings. We derived non-asymptotic bounds that quantify the rate of convergence to our proposed algorithms to a stationary point of the smooth risk measure. As special cases, we showed that our theory and algorithms apply to optimization of MVRM and DRM, respectively. To the best of our knowledge, policy gradient algorithms with non-asymptotic convergence guarantees are not available in the literature for smooth risk measures in general, and the special cases of DRM and MVRM, in particular.

As future work, it would be interesting to investigate the convergence properties of non-smooth risk measures such as CVaR and CPT. While CVaR can be expressed as a DRM, its distortion function is not smooth, and CPT has a similar distortion function that is also non-smooth. To develop policy gradient algorithms, one could explore the possibility of using smooth approximations of these distortion functions and analyze their convergence properties.
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{acknowledgements}
This work was supported in part by Women Leading IITM 2023 grant.
\end{acknowledgements}
\clearpage\newpage
\bibliography{vijayan_677}
\end{document}
