\section{Introduction}
\label{sec:intro}
Distributional reinforcement learning (RL)~\citep{distributional-book} aims to maintain an estimate of the full distribution of expected returns rather than only the mean. Compared to a mean-based approach, it can be used to better capture the uncertainty in the transition matrix of the environment~\citep{c51}, as well as the stochasticity of the policy being evaluated, which may enable faster and more stable training by making better use of the data samples~\citep{mavrin2019distributional}.

Non-parametric approximations of the return distribution learned by quantile regression have proven to be very effective in several domains~\citep{iqn, qr-dqn, fqf}, when combined with deep RL agents such as deep Q-networks (DQN)~\citep{dqn} or soft actor-critic (SAC)~\citep{sac}. They come with the major advantage of providing guarantees for the convergence of distributional policy estimation~\citep{qr-dqn}, and in certain cases, of convergence to the optimal policy~\citep{rowland2023analysis}, all while requiring few assumptions on the shape of the return distribution and demonstrating strong empirical performance~\citep{iqn, fqf}. However, the best-performing quantile-based agents are often obtained by replacing the original quantile regression loss function, i.e., an asymmetric $L_1$ loss, by an asymmetric Huber loss, i.e., a hybrid $L_1$-$L_2$ loss. By doing so, distributional guarantees vanish, as the proofs proposed in previous work relied on the $L_1$-based quantile regression~\citep{distributional-book, qr-dqn}. Critically, we show in Section~\ref{sec:atari} that the estimated distributions collapse to their mean in practice. In this paper, \textit{we propose a different approach, based on both quantile and expectile regression, that matches the performance of Huber-based agents while preserving distributional estimation guarantees and avoiding distributional collapse in practice.}

We are not the first to note that asymmetric $L_2$ losses, i.e., that regress \textit{expectiles} of the target distribution, tend to yield degenerate estimated distributions when training agents with temporal difference learning. \citet{er-dqn} note that expectiles of a distribution cannot be interpreted as samples from this distribution, and therefore expectiles other than the mean cannot be directly used to compute the target values in distributional temporal difference learning. Instead, they propose to generate samples from expectiles of the distribution by adding an imputation step, that requires solving a costly root-finding problem. While theoretically justified, we found this approach to be extremely slow in practice, preventing widespread use at scale. In contrast, our dual approach tackles this problem through learning, and only requires an additional two-layer neural network with the computation of a quantile loss function on top of the expectile loss function. This approach therefore adds close to no computational overheads when training Atari agents on modern GPUs.

Our contributions can be summarized as follows:
\begin{itemize}[nosep]%height=5.3cm, [leftmargin=*,nosep]
    \item We propose a novel dual expectile-quantile approach to distributional dynamic programming that provably converges to the true value distribution in the limit  of infinite estimated quantile and expectile fractions. 
    \item We release implicit expectile-quantile networks (IEQN),\footnote{Available at \url{https://github.com/samijullien/ieqn}.} a practical implementation of our dual approach based on implicit quantile networks~\citep{iqn}.
    \item We show both on a toy example and at scale on the Atari-5 benchmark that IEQN \begin{enumerate*}[label=(\roman*)]
        \item avoids distributional collapse, and
        \item matches the performance of the Huber-based IQN-1 approach.
    \end{enumerate*}
\end{itemize}