\section{Method}
\label{sec:method}

\subsection{Dual training of quantiles and expectiles}
\label{sec:dual-training}

Expectiles have been suggested to be more efficient than quantiles for function approximation~\citep{expectilesoriginal, exp-quant-david-goliath}, but unlike quantiles, they cannot be directly used to generate proper samples of the estimated return distribution ($z_j$ in Eq.~\eqref{etd-loss}), which are required in distributional dynamic programming. \citet{er-dqn} propose an \textit{imputation strategy}, i.e., a way to generate samples of a distribution that matches the current set of estimated expectiles, by solving a convex optimisation problem. In our experiments, we found that applying this imputation strategy tends to drastically increase the runtime (around 25 times slower in our setup), making experimentation with such methods close to impossible for researchers with modest computing resources. In this paper, we propose to learn a functional mapping between expectiles and quantiles and use the predicted quantiles to generate samples. 

We learn a single $Z$-function using expectile regression. Therefore, we have $\forall (s,a) \in \mathcal{S} \times \mathcal{A}, \tau \in [0,1], \; Z_\theta(s,a,\tau) \mathrel{\hat=} E_{Z(s,a)}(\tau)$, where $Z$ is the true $Z$-function we wish to estimate. Then, we note that for non-deterministic $Z(s,a)$, the expectile function at a given state-action pair $E_{Z(s,a)} \in \mathbb{R}^{[0,1]}$ is a strictly increasing and continuous function that spans the entire convex hull of the distribution's support~\citep{german-paper}. Meanwhile, the quantile function $Q_{Z(s,a)} \in \mathbb{R}^{[0,1]}$ spans the distribution's support. As a consequence, every quantile is a single expectile, i.e., there exists a functional mapping from quantile fractions to expectile fractions. In this work, we propose to learn such a mapper $m_\phi(s,a,\tau) \mathrel{\hat=} E^{-1}_{Z(s,a)} \circ F^{-1}_{Z(s,a)} (\tau) $
using the quantile regression loss function from Eq.~\eqref{qtd-loss}. We then have $\forall (s,a) \in \mathcal{S} \times \mathcal{A}, \tau \in [0,1], \; Z_\theta(s,a,m_\phi(s, a, \tau)) \mathrel{\hat=} Q_{Z(s,a)}(\tau)$. We can then simply query our estimator of quantiles at the next state-action pair to yield a sound imputation step, while the parameters of the $Z$-function are learned through expectile regression. 

For any tuple $(s, a, s', a')$, our proposed update step can be described as follows:
\begin{enumerate}[nosep]
    \item Sample fractions $\hat{\tau} \sim~\mathcal{U}(0,1)$.
    \item Generate approximate samples of the target distribution using the quantile representation: 
    $$\hat{z} = R(s,a) + \gamma Z_\theta(s',a',m_\phi(s', a', \hat{\tau})).$$
    \item Use expectile regression to learn the $Z$-function: 
    $$ Z_\theta(s,a,\hat{\tau}) \leftarrow \min_\theta \mathcal{L}_E\left(Z_\theta(s,a,\hat{\tau}), \hat{z} \right).$$
    \item Use quantile regression to learn the mapper:
    $$m_\phi(s,a,\hat{\tau}) \leftarrow \min_\phi \mathcal{L}_Q\left(Z_\theta(s,a,m_\phi(s, a, \hat{\tau})), \hat{z} \right).$$
\end{enumerate}


\begin{algorithm*}[!ht]
    \caption{Implicit expectile-quantile networks (IEQN) update}
    \label{alg-ieqn}
    \begin{algorithmic}
        \Require $Z$-function $Z_\theta$, mapper $m_\phi$, fractions $(\tau_i)_{i = 1, \dots, N} \sim \mathcal{U}([0,1])$, learning rate $\lambda$.
        \State Collect experience $(s,a,r,s')$
        \For{$i = 1, \dots, N$}
            \State Compute expectile values $e_i \leftarrow Z_\theta(s,a, \tau_i)$ and quantile values $q_i \leftarrow Z_\theta(s,a, m_\phi(\tau_i))$
            \State Compute the greedy next-action: $$a' \leftarrow \max_{b \in \mathcal{A}} \frac{1}{N} \sum_{i=1}^N Z_\theta(s', b, m_\phi(\tau_i))$$
            \State Compute target samples: $$z_i \leftarrow r + \gamma \cdot \mathrm{stop\_grad}(Z_\theta(s', a', m_\phi(\tau_i)))$$
        \EndFor
        \State Compute the expectile loss: $$\mathcal{L}_E\leftarrow \frac{1}{N^2}\sum_{i=1}^N \sum_{j=1}^N \left(\tau_i \mathds{1}_{z_j > e_i} + (1 - \tau_i) \mathds{1}_{z_j \leq e_i} \right) \left( z_j - e_i \right)^2 $$
        \State Compute the quantile loss: $$\mathcal{L}_{Q} \leftarrow \frac{1}{N^2}\sum_{i=1}^N \sum_{j=1}^N \left(\tau_i \mathds{1}_{z_j > q_i} + (1 - \tau_i) \mathds{1}_{z_j \leq q_i} \right) \left| z_j - q_i \right| $$
        \State Update expectile function parameters: $\theta \leftarrow \theta - \lambda \nabla_\theta \mathcal{L}_E$
        \State Update mapper parameters: $\phi \leftarrow \phi - \lambda \nabla_\phi \mathcal{L}_Q$
    \end{algorithmic}
\end{algorithm*}

The state-action embeddings of the mapper are copied form those of the $Z$-function. This way, the parameters of the $Z$-function (in our experiments below this includes the large image embedding networks and the overall scale of the rewards) are learned using expectile regression, while only the residual shape difference between the quantile and expectile function is learned by the mapper, using quantile regression. 

The update step described above can be formalized as a distributional operator, which we define in Section~\ref{convergence}. We prove that our proposed update operator converges to the distributional Bellman operator in the limit of infinite estimated quantile/expectile fractions. Then, in Section~\ref{ieqn}, we detail a practical implementation of dual expectile-quantile RL based on implicit quantile networks that we name IEQN.


\subsection{Convergence of the dual expectile-quantile operator}
\label{convergence}


In this section, we prove that our proposed update operator converges to the
distributional dynamic programming operator from Eq.~\eqref{eq:ddp} as the number of quantiles and expectiles kept in memory grows infinitely large, i.e., that the error incurred by our dual expectile-quantile operator vanishes in the limit of an infinite number of statistics to be evaluated. This result relies on several properties of the expectile function, including its absolute continuity that we establish in the following lemma:


\begin{restatable}{lemma}{absolutecontinuity}
\label{absolutecontinuity}
    Let $Z$ be a random variable taking values in $[a,b]$ with finite second moment and whose CDF admits finitely many discontinuities. Then, the expectile function $E_Z : \tau \mapsto \arg\min_{e} \mathbb{E}_{z \sim Z} [(\tau \mathds{1}_{z > e} + (1 - \tau) \mathds{1}_{z \leq e} ) ( z - e )^2 ]$ is absolutely continuous on $[0,1]$.
\end{restatable}

\noindent%
The proofs for this lemma and all results below are included in the appendix. We are now able to prove our main result, Theorem~\ref{lemma}, i.e., that our dual regression projection operator approximates the target distribution well in the limit of an infinite number of quantile/expectile fractions:

\begin{restatable}{theorem}{wassersteinbound}
\label{lemma}
Let $\tau_k = \frac{2k -1}{2K}$, for $k = 1, \dots, K$, and let $\Pi_{\mathcal{M}}^K : \mathscr{P}(\mathbb{R}) \rightarrow \mathscr{P}(\mathbb{R})$ be the dual regression projection operator defined as:

 $\forall \eta \in \mathscr{P}(\mathbb{R}),$
\begin{align}
 \Pi_{\mathcal{M}}^K(\eta) &= \frac{1}{K} \sum_{k=1}^{K} \delta_{E_\eta \left(\mathrm{floor}^K \left( E^{-1}_\eta ( F^{-1}_\eta(\tau_k) \right) \right)}  \\
 &= \frac{1}{K} \sum_{k=1}^{K} \delta_{E_\eta \left(\frac{2\left \lfloor K \mathcal{M}(\tau_k)  + 1 / 2 \right \rfloor - 1}{2K} \right)}, 
\end{align}

\noindent%
where $E_\eta : [0,1] \rightarrow \mathbb{R}$ is the expectile function of $\eta$, $F^{-1}_\eta : [0,1] \rightarrow \mathbb{R}$ is the inverse CDF -- i.e., the quantile function -- of $\eta$, and $\mathrm{floor}^K(x) = \tau_{\left \lfloor Kx  + \frac{1}{2} \right \rfloor}$. Let $\eta \in \mathscr{P}(\mathbb{R})$ be a bounded-support probability distribution with finite second moment and whose CDF admits finitely many discontinuities, and let $W_1$ be the $1$-Wasserstein distance. Then:
$$\lim_{K \to \infty} W_1(\Pi_{\mathcal{M}}^K\eta, \eta) = 0 \; .$$
\end{restatable}

\noindent%
Reusing the notation from the theorem, we can formally define our dual expectile-quantile operator. Let $\pi \in \mathscr{P}(\mathcal{A})^\mathcal{S}$ be a policy, we have:
\begin{equation}
    \mathcal{T}_{\mathcal{M}^K}^\pi = \Pi_\mathcal{M}^K \mathcal{T}^\pi \;,
\end{equation}
where $\mathcal{T}^\pi : Z(s_t,a_t) = R(s_t,a_t) + \gamma \mathbb{E}_\pi\left[Z(s_{t+1}^\pi,a_{t+1}^\pi)\right]$ is the distributional Bellman operator (see Section~\ref{sec:drl}).
We can now derive a key corollary in the context of distributional RL training:

\begin{restatable}{corollary}{convergence}
\label{corollary}
    On Markov decision processes with bounded rewards and $\gamma < 1$, the dual expectile-quantile operator converges pointwise to the distributional Bellman operator:
    $$ \lim_{K \to \infty} \mathcal{T}_{\mathcal{M}^K}^\pi = \mathcal{T}^\pi \; \mathrm{pointwise}.$$
\end{restatable}

\noindent%
This result comes in contrast to the failure of the naive expectile operator~\citep{er-dqn} to match the distributional Bellman operator. We now present a practical implementation of an agent using our dual approach.


\begin{figure*}[!ht]
\tabskip=0pt
\halign{#\cr
  \hbox{%
    \begin{subfigure}[b]{\textwidth}
    \centering
    \includegraphics[height=6.5cm, width=\textwidth]{figs/separate_vs_dual_regression.pdf}
    \caption{Approximating a distribution with separate and dual training.}
    \label{fig:separate_vs_dual_regression}
    \end{subfigure}%
  }\cr
  \hbox{%
    \begin{subfigure}{\textwidth}
    \centering
    \includegraphics[height=6.5cm, width=\textwidth]{figs/separate_vs_dual_bellman.pdf}
    \caption{Tabular distributional RL with separate  and dual training.}
    \label{fig:separate_vs_dual_bellman}
    \end{subfigure}%
  }\cr
}
\caption{\textbf{(a)} Approximating a bimodal distribution with quantile and expectile regression. Quantile regression approximates the inverse CDF, albeit with high variance, especially on extreme values (left, blue curves). Expectiles converge very quickly to the expectile function (left, red curves). When training a mapper to generate quantiles from expectiles, quantile estimation becomes much more efficient (right). \textbf{(b)} Distributional RL with function approximation in a chain MDP with 4 states, and a bimodal reward distribution at the last state. The expectile function collapses as the temporal difference error propagates to previous states (left, red curves) while the quantile function is a poor approximation of the inverse CDF (left, blue curves). Our dual method solves both problems (right).}
\label{fig:separate_vs_dual}
\end{figure*}

\subsection{A practical implementation: IEQN}
\label{ieqn}
We use the principle described in Section~\ref{sec:dual-training} to implement IEQN (Algorithm~\ref{alg-ieqn}), a new distributional RL agent based on implicit quantile networks (IQN)~\citep{iqn}. The $Z$-function is modeled as a neural network inputting a state and a fraction $\tau \sim \mathcal{U}(0,1)$, and outputting $\tau$-expectile values for all actions. Its parameters are learned via an asymmetric $L_2$ loss, i.e., expectile regression. We also use a neural network to implement the mapper between quantile fractions and expectile fractions, and learn its parameters via an asymmetric $L_1$ loss, i.e., quantile regression.

