\section{Related Work}


Uncertainty can be divided into epistemic (model) and aleatoric (data) uncertainty \citep{hullermeier_aleatoric_2021}, the latter of which can be further divided into homoskedastic (constant over input space) and heteroskedastic (varies over input space).
Handling heteroskedastic noise historically has been and continues to be an active area of research in statistics \citep{huber_behavior_1967, eubank_detecting_1993, le_heteroscedastic_2005, uyanto_monte_2022} and machine learning \citep{abdar_review_2021}, but is less common in deep learning \citep{kendall_what_2017, fortuin_deep_2022}, probably due to pathologies that we analyze in this work. 
Heteroskedastic noise modeling can be interpreted as reweighting the importance of datapoints during training time, which \citet{wang_robust_2017} and \citet{mandt_variational_2016} show to be beneficial in the presence of corrupted data and \citet{khosla_neural_2022} in active learning.

To the best of our knowledge, \cite{nix_estimating_1994} were the first to model a mean and standard deviation function with neural networks and Gaussian likelihood.
\cite{skafte_reliable_2019} suggest changing the optimization loop to train the mean and standard deviation networks separately, treating the standard deviations variationally and integrating them out as \cite{takahashi_student-t_2018} does in the context of VAEs, accounting for the location of the data when sampling, and setting a predefined global variance when extrapolating. 
\cite{stirn_variational_2020} also perform amortized VI on the standard deviations and evaluate their model from the perspective of posterior predictive checks.
\cite{seitzer_pitfalls_2022} provide an in-depth analysis of the shortcomings of MLE estimation in this setting and adjust the gradients during training to avoid pathological behavior. 
\cite{stirn_faithful_2023} extend the idea of splitting mean and standard deviation network training in a setting where there are several shared layers to learn a representation before emitting mean and standard deviation. 
Finally, \cite{immer_effective_2023} take a Bayesian approach to the problem and use Laplace approximation on the marginal likelihood to perform empirical Bayes. This allows for regularization to be applied through the prior and for separation of model and data uncertainty. 
While these works propose practical solutions, in contrast to our work, none of them study the theoretical underpinnings of these pathologies, let alone in a model- or data-agnostic way.
