\section{Background}
\label{sec:background}

\subsection{Distributional reinforcement learning}
\label{sec:drl}

We consider an environment modeled by a Markov decision process (MDP) $(\mathcal{S}, \mathcal{A}, R, T, \gamma)$, where $\mathcal{S}$ and $\mathcal{A}$ are a state and action space, respectively, $R(s,a)$ denotes the stochastic reward obtained by taking action $a$ in state $s$, $T(\cdot \mid s,a)$ is the probability distribution over possible next states after taking $a$ in $s$, and $\gamma$ is a discount factor. Furthermore, we write $\pi(\cdot \mid s)$ for a (potentially stochastic) policy selecting the action depending on the current state. 

We consider the problem of finding a policy maximizing the average discounted return, i.e., 
\begin{equation}
    \pi^* = \arg\max_\pi \mathbb{E} \left[\sum_{t=0}^\infty \gamma^t R(s_t, a_t) \right],
\end{equation}
where $a_t \sim \pi(\cdot \mid s_t)$ and $s_{t+1} \sim T(\cdot \mid s_t, a_t)$. We can define the action-value random variable for policy $\pi$ as $Z^\pi : (s,a) \mapsto \sum_{t=0}^\infty \gamma^t R(s_t, a_t)$, with $s_0=s, a_0=a$. We will refer to action-value variables and their estimators as $Z$-functions in the remainder. Note that the $Q$-function, as usually defined in RL~\citep{sutton2018reinforcement}, is given by $Q^\pi(s,a) = \mathbb{E}\left[ Z^\pi(s,a) \right]$. In this work, we consider approaches that evaluate policies through distributional dynamic programming, i.e., by repeatedly applying the distributional Bellman operator $\mathcal{T}^\pi$ to a candidate $Z$-function:
\begin{equation}
    \mathcal{T}^\pi Z(s_t,a_t) = R(s_t,a_t) + \gamma \mathbb{E}_\pi\left[Z(s_{t+1}^\pi,a_{t+1}^\pi)\right].
\label{eq:ddp}
\end{equation}
This operator has been shown to be a contraction in the $p$-Wasserstein distance and therefore admits a unique fixed point $Z^\pi$~\citep{c51}. A major challenge of distributional RL resides in the choice of representation for the action-value distribution, as well as the empirical implementation of the distributional Bellman operator. For simplicity, in the remainder and in line with previous work, we only consider empirical distributions~\citep[Definition~5.5]{distributional-book} (i.e., whose representation can fit in finite memory), and refer to the empirical representation distributional Bellman operator~\citep[Algorithm~5.1]{distributional-book} as $\mathcal{T}^\pi$.

\subsection{Quantile and expectile regression}
\label{section:regression}

Let $Z$ be a real-valued probability distribution. The $\alpha$-\emph{quantile} $q_\alpha$ of $Z$ is defined as a value splitting the probability mass of $Z$ in two parts of weights $\alpha$ and $1 - \alpha$, respectively:
\begin{equation}
    P(z \leq q_\alpha) = \alpha.
\label{quantile-def}
\end{equation}
Therefore, the \textit{quantile function} $Q_Z : \alpha \mapsto q_\alpha $ is the inverse cumulative distribution function: $Q_Z = F_Z^{-1}$. Alternatively, quantiles are given by the minimizer of an asymmetric $L_1$ loss:
\begin{equation}
\mbox{}\hspace*{-1mm}
    q_\alpha = \arg\min_{q} \mathbb{E}_{z \sim Z} \left[\left(\alpha \mathds{1}_{z > q} + (1 - \alpha) \mathds{1}_{z \leq q} \right)\left| z - q \right| \right].
\hspace*{-1mm}\mbox{}    
\label{quantile-reg}
\end{equation}
Expectiles and the \textit{expectile function} $E_Z : \tau \mapsto e_\tau$ are defined analogously, as the $\tau$-expectile $e_\tau$ minimizes the asymmetric $L_2$ loss:
\begin{equation}
\label{expectile-reg}
\mbox{}\hspace*{-1mm}
    e_\tau \!=\! \arg\min_{e} \mathbb{E}_{z \sim Z} \!\left[\left(\tau \mathds{1}_{z > e} + (1 - \tau) \mathds{1}_{z \leq e} \right) \left( z - e \right)^2 \right]\!.
\hspace*{-1mm}\mbox{}    
\end{equation}


\subsection{Quantiles and expectiles in distributional RL}

Quantile regression has been used for distributional RL in many previous studies~\citep[see, e.g.,][]{iqn, qr-dqn, fqf} where a parameterized quantile function $Q_Z^\theta(s,a,\alpha)$ is trained using a quantile temporal difference loss function derived from Eq.~\eqref{quantile-reg}, i.e., for $N$ estimated quantiles:
\begin{equation}
%\begin{split}
\label{qtd-loss}
%\mbox{}\hspace*{-2mm}
%&
\mathcal{L}_Q\left(Q_Z^\theta(s,a, \cdot), \mathbf{z} \right) \!=\! \sum_{i=1}^N \sum_{j=1}^N l_Q(q_i, z_j), 
%\textup{  with }
%\\
%   &l_Q(q_i, z_j) \!=\! \left(\alpha_i \mathds{1}_{z_j > q_i} + (1 - \alpha_i) \mathds{1}_{z_j \leq q_i} \right)\left| z_j - q_i \right|,
%\hspace*{-2mm}\mbox{}
%\end{split}
\end{equation}
with $l_Q(q_i, z_j) \!=\! (\alpha_i \mathds{1}_{z_j > q_i} + (1 - \alpha_i) \mathds{1}_{z_j \leq q_i} )| z_j - q_i |$,
%
where the trainable quantile values $q_i = Q_Z^\theta(s,a, \alpha_i)$ are obtained by querying the quantile function at various quantile fractions $\alpha_i$, which can be either fixed by the designer~\citep{qr-dqn}, sampled from a distribution~\citep{iqn}, or learned during training~\citep{fqf}. In quantile-based temporal difference (QTD) learning, the target samples $z_j$ can be obtained by querying the estimated quantile function at the next state-action pair: $z_j = r + \gamma Q_Z^\theta(s',a', \alpha_j)$.\footnote{We can have $a' \sim \pi(\cdot \mid s')$, as in actor-critic algorithms, or $a' = 
\arg\max_a \mathcal{Q}_Z^\theta(s',a, \alpha_j)$ as in Q-learning. This section is agnostic to that choice but we refer to~\citep{distributional-book} for convergence analysis in the latter case.} Indeed, because the true quantile function is the inverse CDF of the action-value distribution, \citet{qr-dqn} and \citet{distributional-book} showed that, among $N$-atoms representations, quantiles at equidistant fractions minimize the $1$-Wasserstein distance with the action-value distribution and that the resulting projected Bellman operator is a contraction mapping in such a distance. \citet{rowland2023analysis} extended these results to prove the convergence of QTD learning under mild assumptions. We refer to these studies for a more detailed convergence analysis.

In contrast, expectile-based temporal difference (ETD) learning does not allow the same training loss as the one given by Eq.~\eqref{qtd-loss}. We first write the generic ETD loss derived from Eq.~\eqref{expectile-reg}:
%
\begin{equation}
\label{etd-loss}
%\begin{split}
% \mbox{}\hspace*{-2mm}
%& 
\mathcal{L}_E\left(E_Z^\theta(s,a, \cdot), \mathbf{z} \right) \!=\! \sum_{i=1}^N \sum_{j=1}^N l_E(e_i, z_j), 
%\textup{ with}\\
%&l_E(e_i, z_j) \!=\! \left(\tau_i \mathds{1}_{z_j > e_i} + (1 - \tau_i) \mathds{1}_{z_j \leq e_i} \right)\left( z_j - e_i \right)^2,
% \hspace*{-2mm}\mbox{}
%\end{split}
\end{equation}
with 
$l_E(e_i, z_j) \!=\! (\tau_i \mathds{1}_{z_j > e_i} + (1 - \tau_i) \mathds{1}_{z_j \leq e_i} )( z_j - e_i )^2$,
%
and $e_i = E_Z^\theta(s,a, \tau_i)$. Here, choosing $z_j = r + \gamma E_Z^\theta(s',a', \tau_j)$, analogously to QTD learning and non-distributional TD learning, would cause the update to approximate a different distribution because the expectile function is in general not the inverse CDF of the return distribution, meaning that expectiles cannot be considered as samples from the distribution. \citet{er-dqn} formalized this idea using the concept of \textit{Bellman-closedness}, i.e., that the projected Bellman operator yields the same statistics whether it is applied to the target distribution or to the implicit distribution given by statistics of the target distribution (i.e., in our case a uniform mixture of diracs with locations given by quantiles or expectiles). 
