\section{Related Work}
\label{sec:related-work}

\subsection{Distributional reinforcement learning}
\label{sec:drl-related}

Distributional reinforcement learning has been shown to result in several benefits over a mean-based approach -- by ascribing randomness to the value of a state-action pair, an algorithm can learn more efficiently for close states and actions~\citep{mavrin2019distributional}, as well as capture possible stochasticity in the environment~\citep{sddrl}. Some works also use distributional RL for risk-sensitive control~\citep{pmlr-v139-fei21a,NEURIPS2022_c88a2bd0, NEURIPS2022_d2511dfb}. Multiple families of approaches have emerged.

Estimating a parameterized distribution is a straightforward approach, and has been explored from both Bayesian~\citep{strens2000bayesian, vlassis2012bayesian} and frequentist~\citep{gtdqn} perspectives. However, this usually requires an expensive likelihood computation, as well as making a restrictive assumption on the shape of the return distribution $Z$. For instance, assuming a normal distribution when the actual distribution is heavy-tailed can yield disappointing results.

Thus, approaches based on non-parametric estimation are also used to approximate the distribution. C51~\citep{c51} quantizes the domain where $Z$ has non-zero density (usually in 51 atoms, hence the name), and performs weighted classification on the atoms, by computing the cross-entropy between $Z$ and $\mathcal{T}^\pi Z$. While C51 increases performance over non-distributional RL, it requires the user to manually set the return bounds and is not guaranteed to minimize any $p$-Wasserstein metric with the target return distribution.

Another important non-parametric approach to the estimation of a distribution is quantile regression. Quantile regression relies on the minimization of an asymmetric $L1$ loss. Estimating quantiles allows one to approximate the action-value distribution without relying on a shape assumption. QR-DQN~\citep{qr-dqn} introduced quantile regression as a way to minimize the $1$-Wasserstein metric between $Z$ and $\mathcal{T}^\pi Z$. ER-DQN~\citep{er-dqn} traded the estimation of quantiles for expectiles, at the cost of a potential distribution collapse, which they prevent via a root-finding procedure.
Further, implicit quantile networks (IQN)~\citep{iqn} sample and embed quantile fractions, instead of keeping them fixed, thereby improving performance. Fully parameterized quantile functions (FQF)~\citep{fqf} add another network generating quantiles fractions to be estimated. 
We build on IQN and its expectile counterpart to propose a well-performing, non-collapsing agent.

\subsection{Expectile regression}
\label{sec:expectile-regression}
Expectiles were originally introduced as a family of estimators of \textit{location parameters} for a given distribution, to palliate possible heteroskedasticity of the error terms in regression~\citep{expectilesoriginal, expectile-blue}.

Expectiles can be seen as mean estimators under missing data~\citep{expbible}. Unlike quantiles, they span the entire convex hull of the distribution's support, and on this ensemble, the expectile function is strictly increasing: an expectile fraction is always associated to a unique value. Expectiles have been used in reinforcement learning successfully before~\citep{er-dqn}, but in a way that requires a slow optimization step to achieve satisfactory performance. Moreover, expectile regression is subject to the same crossing issue as quantiles, albeit empirically less so~\citep{exp-quant-david-goliath}.
Expectiles have also been used in offline reinforcement learning to compute a soft maximum over potential outcomes seen in the offline data~\citep{kostrikov2022offline}.

Importantly for our work, it has been shown that under mild assumptions expectile regression is the best linear unbiased estimator of any location parameter within the range of the distribution, which includes any quantile of the distribution~\citep{expectile-blue}. In particular, expectile regression has lower variance than quantile regression for estimating quantiles of the distribution. This theoretical property has been confirmed empirically by~\citet{exp-quant-david-goliath}. These observations encourage us to use expectile regression as a way to estimate quantiles of the value distribution, which we describe in the next section. In contrast to prior works that proposed numerical solutions to the problem of mapping an estimated expectile to its corresponding quantile~\citep{er-dqn, exp-quant-david-goliath}, we propose a learning-based approach to this problem.

