\section{Introduction}
\label{sec:intro}

Homoskedastic regression models assume constant (e.g., Gaussian) output noise and amount to learning a function $f(x)$ that tries to predict the most likely target $y$ for input $x$. In contrast, \emph{heteroskedastic} models assume that the output noise may depend on the input features $x$ as well, and try to learn a conditional distribution $p(y|x)$ with non-uniform variance. The promise of this approach is to assign different importances to training data and to train models that “know where they fail” \citep{skafte_reliable_2019, fortuin_deep_2022}. 

Unfortunately, overparameterized heteroskedastic regression models (e.g., based on deep neural networks) are prone to extreme forms of overfitting \citep{lakshminarayanan_simple_2017, nix_estimating_1994}. On the one hand, the mean model is flexible enough to fit every training datum's target perfectly, while the standard deviation network learns to maximize the likelihood by shrinking the predicted standard deviations to zero. On the other hand, just the tiniest amount of regularization on the mean network will make the model prefer a constant solution. Such a flat prediction results from the standard deviation network's ability to explain all residuals as random noise, thus overfitting the data's empirical prediction residuals. \cref{fig:cartoonphases} shows both types of overfitting. 

While several practical solutions to learning overparameterized heteroskedastic regression models have been proposed \citep{skafte_reliable_2019, stirn_variational_2020, seitzer_pitfalls_2022, stirn_faithful_2023, immer_effective_2023}, no comprehensive theoretical study of the failure of these methods has been offered so far. We conjecture this is because overparameterized models have attracted the most attention only in the past few years, while most classical statistics have focused on under-parameterized (e.g., linear) regression models where such problems cannot occur \citep{huber_behavior_1967, astivia_heteroskedasticity_2019}.

This paper provides a theoretical analysis of the failure of heteroskedastic regression models in the overparameterized limit. To this end, it borrows a tool that abstracts away from any details of the involved neural network architectures: classical field theory from statistical mechanics \citep{landau_statistical_2013,altland_condensed_2010}. Via our field-theoretical description, we can recover the optimized heteroskedastic regressors as solutions to partial differential equations that can be derived from a variational principle. These equations can in turn be solved numerically by optimizing the field theory’s free energy functional. 

Our analysis results in a two-dimensional \emph{phase diagram}, representing the coarse-grained behavior of heteroskedastic noise models for every parameter configuration. Each of the two dimensions corresponds to a different level of regularization of either the mean or standard deviation network. As encountered in many complex physical systems, the field theory unveils \emph{phase transitions}, i.e., sudden and discontinuous changes in certain \emph{observables} (metrics of interest) that characterize the model, such as the smoothness of its mean prediction network, upon small changes in the regularization strengths. In particular, we find a sharp phase boundary between the two types of behavior outlined in the first paragraph, at weak regularization. 

Our contributions are as follows:
\begin{itemize}[wide, labelwidth=!, labelindent=0pt]
\item We provide a unified theoretical description of overparameterized heteroskedastic regression models, which generalizes across different models and architectures, drawing on tools from statistical mechanics and variational calculus.
\item In this framework, we derive a field theory (FT), which can explain the observed types of overfitting in these models and describe \emph{phase transitions} between them. We show qualitative agreement of our FT with experiments, both on simulated and real-world regression tasks.
\item As a practical consequence of our analysis, we dramatically reduce the search space over hyperparameters by eliminating one parameter. This reduces the number of hyperparameters from two to one, empirically resulting in well-calibrated models. We demonstrate the benefits of our approach on a large-scale climate modeling example. 
\end{itemize}
