

\section{Experiments}
\label{sec:experiments}


We first demonstrate on a toy MDP the benefits of learning quantiles and expectiles together. We then describe our experimental setup and results on the Atari Arcade Learning Environment (ALE).




\subsection{Chain MDP: A toy example}
\label{sec:chain-MDP}

We start by observing the effect of our proposed operator in a toy environment. The MDP comprises $4$ states, each pointing to the next through a unique action and without accumulating any reward, until the last state $s_4$, where the episode terminates and the agent obtains a reward sampled from a bimodal distribution $r \sim ( \frac{1}{2}\mathcal{N}(-2, 1) + \frac{1}{2}\mathcal{N}(+2, 1) )$ (see the Appendix for a visual description).

Figure~\ref{fig:separate_vs_dual_regression} highlights the advantageous properties of expectile regression that were introduced in prior work~\citep{expectile-blue, expbible, exp-quant-david-goliath}. When trying to approximate the distribution of terminal rewards directly from samples (left), we can see that expectile regression yields more accurate estimates than quantile regression in the low-data regime (recall that the quantile function is the inverse CDF while the expectile function is in general not). Interestingly, coupling expectile regression with our mapper (right) allows us to recover the quantile function much more efficiently than quantile regression itself. We can therefore confirm the findings from prior work~\citep{expectile-blue, exp-quant-david-goliath} and conclude that our learning-based procedure for mapping estimated expectiles to their corresponding quantile fraction is effective.

In Figure~\ref{fig:separate_vs_dual_bellman}, we instantiate the problem in a typical dynamic programming setting, to illustrate the deficiencies of regular quantile and expectile dynamic programming. We can observe (left) that quantile function learning is sample-inefficient and fails to approximate the distribution within the given evaluation budget.\footnote{In this figure, we learn quantile and expectile functions parameterized by neural networks, as opposed to Figure~\ref{fig:separate_vs_dual_regression} where each statistic is learned independently from others. This explains why the quantile function's appearance is smoother in this figure.} However, the distribution information is propagated correctly through temporal difference updates, since the quantile functions estimated at each state coincide. In contrast, the expectile function collapses to the mean as the error propagates from $s_4$ to $s_1$. This is due to the fact that expectile values at the next state-action pair cannot be used as pseudo-samples of the return distribution $Z(s_{t+1})$~\citep{er-dqn}. Finally, Figure \ref{fig:separate_vs_dual_bellman}  (right) shows that our dual training method, where the pseudo-samples of $Z(s',a')$ are the estimated quantiles $Z_\theta(s_{t+1}, m_\phi(\tau))$, solves both issues: the expectile function does not collapse anymore and the quantile function approximation is an accurate estimation of the inverse CDF.



\subsection{Experiments on the Atari Arcade Learning Environment}

\subsubsection{Baselines}

We experimented with the following baselines to evaluate our approach:
\begin{description}%[leftmargin=*,nosep]
    \item [IQN-0, IQN-1] We approximate quantiles using the general approach described in IQN~\citep{iqn}, respectively without and with a Huber loss. 
    \item[IEN-Naive] We use a similar approach as for IQN, but trained with an expectile loss and a naive imputation step as described in~\citep{er-dqn}, i.e., expectile values are used as target for the temporal difference loss. The solver-based implementation described by the authors was too slow on our setup, as it was approximately 25 times slower than the other baselines.
\end{description}

\subsubsection{Environments}

We opted to conduct our experiments with the Atari Learning Environment (ALE)~\citep{bellemare13arcade}, following the setup of~\citet{machado18arcade}, notably including a $25\%$ chance to perform a sticky action at each step, i.e., repeating the latest action instead of using the action predicted by the agent. This creates stochasticity in the environment, which should be captured by distributional RL agents. In order to accommodate for limited computing resources, we constrained ourselves to the Atari-5 subbenchmark~\citep{atari5}, yet using 5 seeds to reduce the uncertainty in our results. We perform 25 validation episodes every $1$M steps to generate our performance curves. 
As is common with ALE, we report human-normalized scores, rather than raw game scores, and we aggregate them using the interquartile mean (IQM), as it \textit{is a better indicator of overall performance} (compared to sample median)~\citep{agarwal2021deep}, due to its robustness to scale across tasks and to outliers. It is especially needed, as the presence of sticky actions increases the number of outlier~seeds.




\subsubsection{Implementation details}


We base all baselines and our method on the same underlying neural network, implemented in JAX~\citep{jax2018github}. Its architecture follows the structure detailed by~\citet{iqn}. We used the training loop composition of CleanRL~\citep{cleanrl}. Hyperparameters can be found in the appendix. 
We implemented the $Z$-function for all agents as a feed-forward neural network with layer normalization. We did not use the fraction proposal network introduced with FQF~\citep{fqf}, as our method can be seen as complementary to it, and we focus on the effect of the choice of statistics. Finally, we found that using layer normalization increased performance for both our method and baselines. 

As described in Algorithm~\ref{alg-ieqn}, we only use the expectile loss to update the $Z$-function for our agent, while we use the quantile loss to update our mapper.
The mapper is implemented as a two layer, residual fully-connected neural network with ReLU and Tanh activations.
Since it is queried to obtain both the candidate and target values, we use a mapper-specific target network updated less frequently than the live network, using Polyak averaging~\citep{polyak}
with a weight of $0.5$. We share the parameters across all states, to simplify its architecture. We detail the implications of this choice in the appendix. 



\subsubsection{Results}
\label{sec:atari}

In this section, we verify that our dual approach also provides benefits at scale, on a classic benchmark. 

We first present, in Figure~\ref{fig:iqm-atari5}, the aggregated results over 5 seeds on the Atari-5 benchmark. %A first observation is that IQN-1 yields a higher performance than IQN-0, hinting that the Huber loss (and thus hybrid $L_1$/$L_2$-based approaches) might be more efficient than a sole $L_1$ learning scheme.
We can see that despite a slower start, IEQN ends up matching the performance of IQN-1. To get statistically stronger results, we also performed a bootstrap hypothesis test on the difference of IQMs at the end of training (we average scores from the last 5 validation epochs to be robust to instabilities). We found that our method surpasses the performance of both the quantile approach (achieved significance level $0.0117$), and naive expectile approach (achieved significance level $0$), thereby demonstrating the benefits of dual regression over single regression of either quantiles or expectiles on the final performance.

\begin{figure}[t]
    \centering
    \includegraphics[clip, trim=5mm 0mm 15mm 0mm, width=\linewidth]{figs/iqm.pdf}
    \caption{Interquartile mean of the human normalized score of distributional RL agents on the Atari-5 benchmark with 5 random seeds per environment. Shaded areas correspond to the 25-th and 75-th percentiles of a bootstrap distribution. A rolling average with window size of $20M$ frames is performed to enhance readability.}
    \label{fig:iqm-atari5}
\end{figure}

Furthermore, we verify in Table~\ref{tab:distribution-spread} that IEQN avoids distributional collapse in practice. In fact, while IQN-1's estimated distribution is much narrower than IQN-0's -- a confirmation that the Huber loss causes distributional collapse, despite its better efficiency -- IEQN's quantile spread is much larger than IQN-1's. Moreover, the expectile spread of IEQN is much larger and more stable than that of IEN-Naive, suggesting that expectile distributional RL yields degenerate distributions, as noted by~\citet{er-dqn}, but that dual expectile-quantile distributional RL avoids this collapse. 

\begin{table}[t]
    \caption{Average and standard deviation of the distance between quantile (respectively expectile) $0.1$ and $0.9$, relatively to the scale of the Q-function, at the end of training.}
    \label{tab:distribution-spread}
    \centering
    \begin{tabular}{l c c}
    \toprule
       & Quantiles spread  & Expectiles spread \\
       \midrule
       IQN-0 & 1.25 $\pm$ 0.198 & -\\
       IQN-1 & 0.144 $\pm$ 0.072 & -\\
       IEN-Naive & - & 0.174 $\pm$ 0.195\\
       IEQN & 0.721 $\pm$ 0.142 & 0.465 $\pm$ 0.086\\
    \bottomrule
    \end{tabular}
\end{table} 
