\newpage
\appendix 

\clearpage

\section*{Appendix}
This appendix has the following sections:
\begin{enumerate}[leftmargin=*]
    \item[\ref{appendix:hyperparameters-etc}] Hyperparameters, code and implementation details
    \item[\ref{appendix-mapper}] Sharing the mapper's parameters
    \item[\ref{toy-MDP}] Toy Markov decision process
    \item[\ref{app:proofs}] Proof of Lemma~\ref{absolutecontinuity}
    \item[\ref{appendix:proof-of-theorem-2}] Proof of Theorem~\ref{lemma}
    \item[\ref{appendix:proof-of-corollary-3}] Proof of Corollary~\ref{corollary}
    \item[\ref{appendix:proof-of-corollary-3}] Analysis of the estimated variance
    
\end{enumerate}

\section{Hyperparameters, code and implementation details}
\label{appendix:hyperparameters-etc}
\subsection{Hyperparameters}
\label{hyperparams}


We use JAX~\citep{jax2018github} to train our models.
A full training procedure of $200$M training frames and corresponding validation epochs takes approximately $50$ hours in our setup.    

\begin{table}[!ht]
    \caption{$Z$-function hyperparameters.}
    \label{tab:shared-hyperparams}
    \centering
    \begin{tabular}{l c}
        \toprule
        Key & Value \\
        \midrule
         Discount factor & $0.99$ \\
         Batch size & $32$  \\
         Fraction distribution & $\mathcal{U}([0,1])$ \\
         Learning rate & $1\mathrm{e}^{-4}$ \\
         Random frames before training & $200000$ \\
         Size of convolutional layers & $[32,64,64]$ \\
         Size of fully-connected layer & $512$ \\
         Critic updates per sample & $2$ \\
         Buffer size & $1\mathrm{e}6$ \\
         Frames between target network updates & $35000$ \\
         Target network update rate & $1.0$ \\
         \bottomrule
    \end{tabular}
\end{table}

\begin{table}[!ht]
    \caption{Mapper hyperparameters.}
    \label{tab:mapper-hyperparams}
    \centering
    \begin{tabular}{lc}
        \toprule
        Key & Value \\
        \midrule
         Layer size & $64$ \\
         Learning rate & $7\mathrm{e}^{-5}$ \\
         Target network update rate & $0.5$  \\
         \bottomrule
    \end{tabular}
\end{table}

\subsection{Code}

Our training and evaluation loop is based on CleanRL~\citep{cleanrl}. The code base is available on \url{https://github.com/samijullien/ieqn}.

\section{Sharing the mapper's parameters}
\label{appendix-mapper}

Sharing the mapper's parameters across states and actions allows us to lighten the computational burden, which is part of the goal of this paper. We found this technique to work well in practice on the Atari-5 benchmark, although it requires additional assumptions in theory. We review
these assumptions in this section.

\citet{expectile-location-scale} show that there exists such a shared mapping between quantiles and expectiles when the regression follows a location-scale model, i.e., for random variables $X$ and $Y$: 
$$Y = \mu(X) + \sigma(X) \varepsilon,$$
where $\mu$ and $\sigma$ are continuous functions, $\varepsilon$ is centered and finite-variance, and $\varepsilon, X$ are independent. When the return distribution follows this model, $X$ being the state-action variable in this context, sharing the mapper's parameters is theoretically valid. While this may seem limiting, it does not require all state-action pairs to be allocated the same distributions, only that they share a common shape. Moreover, the location-scale family is quite broad, as it includes, e.g., Normal, Student, Cauchy, GEV distributions, and more~\citep{WeiEA2014}. 

In many distributional reinforcement learning scenarios, the assumption may be satisfied. For instance, when the environmental stochasticity emerges from small, independent perturbations, i.e., normally-distributed errors, the return distribution at every state will still be normally distributed as convolutions of Gaussian distributions are also Gaussian. On the other hand, this assumption can fail under high-frequency transition distributions, i.e., branching behaviors, where the same state-action pair can yield drastically different outcomes and the reward-next-state distribution has non-continuous support. We leave for future work the investigation of when sharing the mapper's parameters across state-action pairs fails in practice.


\section{Toy Markov decision process}
\label{toy-MDP}

\begin{figure}[!h]
    \includegraphics[width=.9\columnwidth]{figs/MDP-diagram.pdf}
    \caption{Toy Markov decision process.}
    \label{fig:toy-mdp}
\end{figure}


\begin{figure*}[!ht]

\section{Proof of Lemma~\ref{absolutecontinuity}}
\label{app:proofs}

    Our proof of Theorem~\ref{lemma} requires the absolute continuity of the expectile function. Therefore, we first prove the following lemma:

\absolutecontinuity*

\begin{proof}
Our proof relies on the Banach-Zarecki theorem~\citep{zaretsky}, which states that any real-valued function $f$ defined on a real bounded closed interval is absolutely continuous if and only if on this interval:
\begin{enumerate}[label=(\roman*)]
    \item $f$ is continuous;
    \item $f$ has bounded variation; and
    \item $f$ follows the Luzin N property~\citep{luzin}, i.e., the image by $f$ of a set with null Lebesgue measure also has null Lebesgue measure.
\end{enumerate}

It is well-known that the expectile function is continuous on $[0,1]$~\citep{german-paper, expbible}. Therefore, (i) is satisfied.

$E_Z$ is monotonically increasing and takes values in the finite support of $Z$. Therefore it has bounded variation and (ii) is satisfied.


In order to prove (iii), we first note that any function that is differentiable on a co-countable set has the Luzin N property~\citep{luzin}. We therefore use our assumption that $Z$ admits a finite number of discontinuities in the following.

Let $F_Z$ be the CDF of $Z$ and $D = \left\{ z \in [a,b] : \lim_{x \to z} F_Z(x) \neq F_Z(z) \right\}$ be the finite set of points at which $F_Z$ is not continuous. $D$ is a finite set within a metric space and therefore closed. As a consequence, its complement $C_{[a,b]} = [a,b] \setminus D$ is open in $[a,b]$, i.e., $\forall z \in C_{[a,b]}, \exists \varepsilon > 0$ such that $\forall x \in [a,b] d(x,z) < \varepsilon \Rightarrow x\in C_{[a,b]}$. In other words, if $F_Z$ is continuous at a point within $[a,b]$, it is also continuous in a neighborhood of that point within $[a,b]$. By assumption, the set $C^\mathcal{N}_{[a,b]} = \left\{ z \in [a,b] : \exists \varepsilon > 0, \forall x \in [a,b] , d(x,z) < \varepsilon \Rightarrow x\in C_{[a,b]}\right\}$ of points where $F_Z$ is continuous in a neighborhood of said point is therefore co-finite.


It has been shown that the expectile function $E_Z$ is continuously differentiable at any point $\tau \in [0,1]$ such that $F_Z$ is continuous in a neighborhood of $E_Z(\tau)$~\citep{german-paper, expectilesoriginal}. The expectile function is bijective~\citep{expbible} so the set of points where $E_Z$ is differentiable $\mathcal{D}_{[a,b]}^{E_Z} = E_Z^{-1}\left(C^\mathcal{N}_{[a,b]} \right)$ is also a co-finite set.


The expectile function is differentiable on a co-finite (and thus co-countable) set, i.e., it has the Luzin N property~\citep{luzin}, which yields (iii). 

We can finally apply the Banach-Zarecki theorem and conclude that the expectile function $E_Z$ is absolutely continuous on $[0,1]$.
\end{proof}
\end{figure*}

\begin{figure*}[t]
\section{Proof of Theorem~\ref{lemma}}
\label{appendix:proof-of-theorem-2}
We can now use the absolute continuity of the expectile function under our assumptions to prove the following theorem:

\wassersteinbound*

\begin{proof}
Thanks to the triangle inequality, we have :
\begin{equation} 
\begin{split}
        W_1(\Pi_{\mathcal{M}}^K\eta, \eta) \leqslant W_1(\Pi_{\mathcal{M}}^K\eta, \Pi_Q^K\eta) + W_1(\Pi_Q^K\eta, \eta) \;, 
\end{split}
\end{equation}
where $\Pi_Q^K$ is the projected quantile regression estimator defined as:
$$
\forall \eta \in \mathscr{P}(\mathbb{R}), \;\; \Pi_Q^K(\eta) = \frac{1}{K} \sum_{k=1}^{K} \delta_{F^{-1}_\eta(\tau_k)} \;.
$$
\citet[Lemma 3.2]{er-dqn} showed that $W_1(\Pi_Q^K\eta, \eta) = \mathcal{O}\left( \frac{1}{K} \right)$. We now turn to the first term:
%
\begin{align}  
        W_1(\Pi_{\mathcal{M}}^K\eta, \Pi_Q\eta) &= \sum_{i=0}^{K-1} \frac{1}{K} \left| E_\eta \left(\mathrm{floor}^K \left( E^{-1}_\eta \left( F^{-1}_\eta \left(\frac{2i + 1}{2K} \right) \right)  \right)\right) - F^{-1}_{\eta} \left( \frac{2i + 1}{2K}\right) \right| 
        \\
        & = \sum_{i=0}^{K-1} \frac{1}{K} \left| E_\eta \left(\mathrm{floor}^K \left( E^{-1}_\eta \left( F^{-1}_\eta \left(\frac{2i + 1}{2K} \right) \right)  \right)\right) - 
        %{}\right. 
        %\\
        %& \hspace*{4cm}
        %\left. 
        E_\eta \left( E^{-1}_\eta \left(F^{-1}_{\eta} \left( \frac{2i + 1}{2K}\right)\right)\right) \right| 
        \\
        & \leqslant \sum_{i=0}^{K-1} \frac{1}{K} \left| E_\eta \left(\mathrm{floor}^K \left( E^{-1}_\eta \left( F^{-1}_\eta \left(\frac{2i + 1}{2K} \right) \right)  \right)\right) -
        %{}\right.
        %\\ 
        %& \hspace*{4cm}\left. 
        E_\eta \left( \mathrm{floor}^K \left( E^{-1}_\eta \left( F^{-1}_\eta \left(\frac{2i + 1}{2K} \right) \right) \right) + \frac{1}{K}\right) \right|,
\end{align}
%
where the last inequality is obtained thanks to the monotonicity of the expectile function. By absolute continuity of the expectile function under our assumptions (proven in Lemma~\ref{absolutecontinuity}), we have:
%
\begin{align}  
    & \lim_{K \to \infty} \left| E_\eta \left(\mathrm{floor}^K \left( E^{-1}_\eta \left( F^{-1}_\eta \left(\frac{2i + 1}{2K} \right) \right)  \right)\right) - 
    %{}\right.
    %\\
    %& \hspace*{4cm} \left. 
    E_\eta \left( \mathrm{floor}^K \left( E^{-1}_\eta \left( F^{-1}_\eta \left(\frac{2i + 1}{2K} \right) \right) \right) + \frac{1}{K}\right) \right| = 0,
\end{align}
%
from which we can deduce $\lim_{K \to \infty} W_1(\Pi_{\mathcal{M}}^K\eta, \Pi_Q\eta)= 0$ and finally $\lim_{K \to \infty} 
 W_1(\Pi_{\mathcal{M}}^K\eta, \eta) = 0$.
\end{proof}
\end{figure*}


\begin{figure*}[!ht]
\section{Proof of Corollary~\ref{corollary}}
\label{appendix:proof-of-corollary-3}

Finally, we can derive our main result for the use of distributional dynamic programming with both quantiles and expectiles:

\convergence*

\begin{proof}
    We have $\mathcal{T}_{\mathcal{M}^K}^\pi = \Pi_\mathcal{M}^K \mathcal{T}^\pi$.
    \citet{distributional-book} have shown that the set of empirical distributions $\mathcal{F}_E$ is closed under the operator $\mathcal{T}^\pi$ (Proposition~5.7). Thus, for any empirical return distribution $\eta \in \mathcal{F}_E$, $\mathcal{T}^\pi \eta$ is also empirical and its CDF admits finitely many discontinuities. Moreover, it has bounded support. Indeed, if, without loss of generality, we consider that the reward distribution take values in $[0, R_\mathrm{max}]$, we have that every possible return distribution $\eta$ takes values in $[0, \frac{R_\mathrm{max}}{1-\gamma} ]$, and therefore $\mathcal{T}^\pi \eta$ takes values in $[0, R_{max} + \gamma \frac{R_\mathrm{max}}{1-\gamma} ] = [0, \frac{R_\mathrm{max}}{1-\gamma} ]$.

    We can now apply Theorem~\ref{lemma}:
    $$ \forall \eta \in \mathcal{F}_E \;,  \lim_{K \to \infty} 
W_1(\Pi_{\mathcal{M}}^K\mathcal{T}^\pi \eta, \mathcal{T}^\pi \eta) = 0, $$ 
and the result immediately follows. 
\end{proof}
\end{figure*}

\begin{figure*}[!ht]
\section{Analysis of the estimated variance}

In this section, we perform an additional experiment to better assess the quality of the value distribution on the Atari task. The distribution learned in our method as well as all baselines estimates the optimal Z-function, i.e., the return distribution of the optimal policy, which we cannot have ground truth for on large-scale tasks. We may however assume that the greedy policy gets closer to the optimal policy towards the end of training. If we do so, then we can compare the learned Z-function with the return distribution obtained by unfolding our agent's policy. Below, we show the variance of the learned Z-function (Figure \ref{fig:pred_variance}), and the average deviation between this prediction and the observed squared differences when rolling out the policy (Figure \ref{fig:mae_variance}), throughout the first 50M steps of training on Battlezone.


\begin{subfigure}[b]{\columnwidth}
    \includegraphics[width=\columnwidth]{figs/predicted_variance.png}
    \caption{Predicted variance on Battlezone during training.}
    \label{fig:pred_variance}
\end{subfigure}%
\begin{subfigure}[b]{\columnwidth}
    \includegraphics[width=\columnwidth]{figs/mae_variance.png}
    \caption{Error on variance on Battlezone during training.}
    \label{fig:mae_variance}
\end{subfigure}%
\label{fig:variance_analysis}
\caption{Comparison of estimated variance against observed variance of unfolding the greedy policy.}

\bigskip

We can see that (i) IQN-1 predicts a very low variance compared to IEQN, and (ii) using the approximation that the current policy is close to the optimal policy, IEQN's prediction gets closer to the observed variance than IQN-1's, as training progresses. 
\end{figure*}