\section{Pitfalls of Overparameterized Heteroskedastic Regression}

\paragraph{Heteroskedastic Regression}
Consider the setting in which we have a collection of independent data points $\mathcal{D}:=\{(x_i, y_i)\}_{i=1}^N$ with covariates $x_i \in \mathcal{X} \subset \R^d$ drawn from some distribution $x_i \sim p(x)$ and response values $y_i \in \mathcal{Y} \equiv \R$ normally distributed with unique mean $\mu_i$ and precision (inverse-variance) $\Lambda_i>0$ (i.e., $y_i \sim \norm(\mu_i, \Lambda_i)$). We assume to be in a \emph{heteroskedastic} setting, in which $\Lambda_i$ need not equal $\Lambda_j$ for $i\neq j$. Finally, we assume \emph{both} the mean and standard deviation of $y_i$ to be explainable via $x_i$:
\begin{align}
y_i\sep x_i \sim \norm(\mu(x_i), \Lambda(x_i)) \text{ for } i = 1,\dots, N
\end{align}
with continuous functions $\mu:\mathcal{X}\rightarrow\R$ and $\Lambda:\mathcal{X}\rightarrow\R_{>0}$. In a modeling setting, learning $\Lambda$ can be seen as directly estimating and quantifying the \emph{aleatoric} (data)  uncertainty.

\paragraph{Overparameterized Neural Networks}
There exist many options for modeling $\mu$ and $\Lambda$. Of particular interest to many is representing each of these functions as neural networks \citep{nix_estimating_1994}---specifically ones that are overparameterized. These models are well-known \emph{universal function approximators}, which makes them great choices for estimating the true functions $\mu$ and $\Lambda$ \citep{hornik_approximation_1991}. 

Let the mean network $\nnmu:\mathcal{X}\rightarrow\R$ and precision network $\nnLambda:\mathcal{X}\rightarrow\R_{>0}$ be arbitrary depth, overparameterized feed-forward neural networks parameterized by $\theta$ and $\phi$ respectively. For a given input $x_i$, these networks collectively represent a corresponding predictive distribution for $y_i$:
\begin{align}
\hat{p}(y_i\sep x_i) := \norm(y_i; \;\! \nnmu(x_i), \nnLambda(x_i)).
\end{align}
\begin{figure*}
\centering
\includegraphics[width=.9\textwidth]{figures/phase.pdf}
\caption{Visualization of a typical phase diagram in $\rho-\gamma$ regularization space for a heteroskedastic regression model (left). Solid and dotted lines indicate sharp and smooth transitions in model behavior respectively. Example model mean fits shown in red (with pointwise $\pm$ standard deviation in orange) from the FT for each key phase (middle and right). }
\label{fig:cartoonphases}
\end{figure*}
\paragraph{Pitfalls of MLE}
Our assumed form of data naturally suggests training $\nnmu$ and $\nnLambda$, or rather learning $\theta$ and $\phi$, by minimizing the cross-entropy between the joint data distribution $p:=p(x,y)=p(y\sep x)p(x)$ and the induced predictive distribution $\hat{p}:=\hat{p}(y\sep x) p(x)$. This objective is defined as
\begin{align}
 \mathcal{L}&(\theta, \phi) := H(p, \hat{p})  = -\E_{p}\left[\log \hat{p}(x,y)\right] \\
&= \int_{\mathcal{X}}\!p(x)\!\int_{\mathcal{Y}} \!p(y\sep x) \log \norm(y; \;\! \nnmu(x), \nnLambda(x)) dydx + c, \nonumber
\end{align}
where $c$ is a constant with respect to $\theta$ and $\phi$. This expectation is often approximated using a Monte Carlo (MC) estimate with $N$ samples, yielding the following tractable objective function:
\begin{align}
\mathcal{L}(\theta, \phi) \approx \frac{1}{2N}\sum_{i=1}^N\nnLambda(x_i)\hat{r}(x_i)^2 - \log \nnLambda(x_i), \label{eq:mle}
\end{align}
where $\hat{r}(x_i) = \nnmu(x_i) - y_i$. 
Minimizing this cross-entropy objective function with respect to parameters $\theta$ and $\phi$ using data samples is synonymous with maximum likelihood estimation (MLE). 

Unfortunately, given an infinitely flexible model, this objective function is ill-posed. Our first observation is that, for any non-zero $\nnLambda$, we can find a solution for the parameters $\phi$ in the absence of any regularization since  
the first term in \cref{eq:mle} is minimized when $\nnLambda \rightarrow 0$, while the second term is minimized when $\nnLambda \rightarrow \infty$. However, the interplay between $\phi$ and $\theta$ leads to divergences in the absence of any regularization on $\theta$. Without such regularization, the mean function $\nnmu$ will estimate $y$ perfectly (or rather to arbitrary precision) for at least a single data point $(x_i, y_i)$. Once this happens, the residual for this input $\nnmu(x_i)-y_i$ approaches zero, and the implicit regularization for $\nnLambda$ vanishes, leading $\nnLambda(x_i)$ to diverge to infinity. Intuitively, the model becomes infinitely (over-)confident in its prediction for this data point. 
Once training has reached this point, the objective function becomes completely unstable due to effectively containing a term whose limit na\"ively yields $\infty - \infty$.\footnote{Note that this is predicated on the model being flexible enough to allow for large changes in predictions $\nnmu(x)$ and $\nnLambda(x)$ after iteratively updating parameters $\theta$ and $\phi$ while allowing for minimal changes in neighboring predictions (i.e., $\nnmu(x')$ and $\nnLambda(x')$ for some $x'\in\mathcal{X}$ such that $0 < ||x-x'|| < \epsilon$).}

\paragraph{Regularization}
Even though $\nnLambda$ is implicitly regularized in the standard cross-entropy loss as mentioned earlier, we posit that additional regularization on $\nnLambda$, or rather $\phi$, is required to avoid this instability. It can be tempting to think that one must regularize $\theta$ in order to avoid overfitting. And while this is generally true, the objective function $\mathcal{L}$ will still be unstable so long as \emph{at least} one input $x_i$ yields a perfect prediction ($i.e., y_i=\nnmu(x_i)$). This situation is still fairly likely to occur even in the most regularized mean predictors and cannot be avoided, especially if $\{y_i\}$ is zero-centered. 

To prevent this from happening, we can include $L_2$ penalty terms for both $\theta$ and $\phi$ in our loss function:
\begin{align}
\mathcal{L}_{\alpha,\beta}(\theta, \phi) := \mathcal{L}(\theta, \phi) + \alpha||\theta||_2^2 + \beta||\phi||^2_2,
\end{align}
where $\alpha,\beta > 0$ are penalty coefficients. Intuitively, the primary purpose of regularizing $\theta$ is to prevent the mean predictions from overfitting while the goal of regularizing $\phi$ is to provide stability and control complexity in the predicted aleatoric uncertainty. As $\alpha \rightarrow \infty$, the network models a constant mean and, symmetrically, as $\beta \rightarrow \infty$ the network models a constant standard deviation. That is, we effectively arrive at a homoskedastic regime as $\beta \rightarrow \infty$.\footnote{This is under the assumption that either the networks have an unpenalized bias term in the final layer \emph{or} that the data is standardized to have zero mean and unit variance.}


\paragraph{Reparameterized Regularization}
We introduce an alternative parameterization of the regularization coefficients:
\begin{align}
\mathcal{L}_{\rho,\gamma}(\theta, \phi) := \rho\mathcal{L}(\theta, \phi) + \bar{\rho}\left[\gamma||\theta||_2^2 + \bar{\gamma}||\phi||^2_2\right],
\end{align} 
where we restrict $\rho, \gamma \in (0, 1)$ and define $\bar \rho := 1-\rho$ and $\bar \gamma := 1-\gamma$.  
This parameterization is one-to-one with the $\alpha, \beta$ parameterization (with $\alpha = \gamma\bar{\rho}  / \rho$ and $\beta =  \bar{\gamma} \bar{\rho}/ \rho$) and it can be shown that $\nabla_{\theta, \phi} \mathcal{L}_{\rho,\gamma}\propto \nabla_{\theta, \phi} \mathcal{L}_{\alpha, \beta}$, thus minimizing one objective is equivalent to minimizing the other. 
Because $\rho$ and $\gamma$ are bounded we are able to completely cover the space of regularization combinations by searching over $(0, 1)^2$ whereas in the $\alpha, \beta$ parameterization $\alpha, \beta \in \R_{>0}$ are unbounded. 
Now, $\rho$ determines the relative importance between the likelihood and the total regularization imposed on both networks. 
On the other hand, $\gamma$ weights the proportion of total regularization between the mean and precision networks. 
Here, $\rho = 1$ corresponds to the MLE objective while $\rho\to0$ could be interpreted as converging to the mode of the prior in a Bayesian setting. 
Fixing $\gamma=1$ leads to an unregularized precision function while choosing $\gamma=0$ results in an unregularized mean function. 
\begin{table*} %
\centering
\caption{FT Limiting Cases. We provide intuition for Prop.~\ref{prop:existence} and match the limits to the phase diagram regions in \cref{fig:cartoonphases}.}

\begin{tabular}{ m{2.48cm} m{13.83cm} }
\toprule 
Regularization & Outcome \\
\midrule
$\rho \to 1,\gamma \in [0, 1]$  &  
This is equivalent to MLE. Approaching $\rho=1$, we observe overfit mean solutions (see $O_\text{I}$ and $O_\text{II}$ in \cref{fig:cartoonphases}) across all $\gamma$. 
In theory, at $\rho=1$, there is a contradiction implying no solution should exist. 
\\[0.25cm]
\midrule \\[-0.35cm]
$\rho \to 0, \gamma \in (0, 1)$   &  
The objective is dominated by the regularizers---the data is completely ignored. This corresponds with region $U_\text{I}$. In theory, the optimal solution at $\rho=0$ is for both $\ftmu, \ftp$ to be constant (flat) functions. 
\\[0.25cm]
\midrule \\[-0.35cm]
$\rho \in (0, 1), \gamma \to 1$  &  
All regularization is placed on the mean function, leading to underfit mean. However, the precision is unregularized and the residuals are perfectly matched. This is the top row of the phase diagrams.
\\[0.25cm]
\midrule \\[-0.35cm]
$\rho \in (0, 1), \gamma \to 0$ & 
The mean is unregularized and the precision is strongly regularized. These fits are characterized by severe overfitting and can be found along the bottom row of the phase diagrams. 
\\
\bottomrule
\end{tabular}
\label{tab:cases}
\end{table*}

\paragraph{Qualitative Description of Phases} Model solutions across the space of $\rho$ and $\gamma$ hyperparameters exhibit different traits and behaviors. Similar to physical systems, this can be described as a collection of typical states or \emph{phases} that make up a \emph{phase diagram} as a whole. We find that these phase diagrams are typically consistent in shape across datasets and methodologies. 
\cref{fig:cartoonphases} shows an example phase diagram along with model fits coming from specific $(\rho, \gamma)$ pairings. We argue that there are five primary regions of interest and qualitatively characterize them as follows:
\begin{itemize}[wide, labelwidth=!, labelindent=0pt]
      \setlength\itemsep{0.1em}
    \item Region $U_\text{I}$: Both the mean and precision functions are heavily regularized. The likelihood is so lowly weighted it is as if the model had not seen the data. Regardless of the $\gamma$-value, the likelihood plays a minor role in the objective. The mean and standard deviation functions are constant through zero and 1 (the values they were initialized to). 
    \item Region $U_\text{II}$: The mean function is still heavily regularized and tends to be flat, underfitting the data as in Region $U_{I}$. 
    Despite the constant mean function, the precision function can still accommodate the residuals.
    \item Region $O_\text{I}$: The mean is heavily overfit and the residuals and corresponding standard deviations essentially vanish.
    Increasing $\rho \to 1$ yields true MLE fits (right side of the phase diagram). This portion of the phase exists across a wide range of $\gamma$-values. Low values of $\gamma$ restrict the flexibility of the precision function, but due to the overfitting in the mean, the flexibility is not needed to fit the residuals. 
    \item Region $O_\text{II}$: The mean function does not overfit due to regularization, leaving large residuals for the lowly regularized precision function to overfit onto. The predicted standard deviation matches each residual perfectly. 
    \item Region $S$: 
    The mean and precision functions adapt to the data without overfitting. We conjecture that solutions in this region will provide the best generalization.
\end{itemize}

\section{Theoretic Considerations}
We proceed to develop a theoretical description of the interplay between regularization strengths and resulting model behavior that captures the limiting behavior of heteroskedastic neural networks in the completely overparameterized regime.
This tool allows us to analytically study edge cases of combinations of regularization strengths and find necessary conditions any pair of optimal mean and standard deviation functions must satisfy, agnostic of any specific model architecture. Furthermore, numerical solutions to our \emph{field theory}, explained below, show good qualitative agreement with practical neural network implementations. 


\begin{figure*}[!htb]
\centering
\includegraphics[width=.8\textwidth]{figures/heatmaps.pdf}
\caption{
Array plot of metrics (rows) across different data or fitting techniques (columns). Leftmost column: results from our field theory (FT); remaining columns: results from fitting neural networks to data (data sets refer to test splits). Averaged results of six runs are shown. Intermediate ticks mark $\gamma=0.5$ and $\rho=0.5$ on the lower-left plot. Our FT aligns qualitatively well with empirical phase diagrams, with consistent phase transitions across models and datasets.
}
\label{fig:summary}
\end{figure*}

\paragraph{Field Theory}
Having discussed the effects of regularization on a heteroskedastic model on a qualitative level, we ask the following questions: \emph{How much do these effects depend on any particular neural network architecture? Can we describe some of these effects on the function level, i.e., without resorting to neural networks?} To address these questions, we will establish \emph{field theories} from statistical mechanics. 




Field theories are statistical descriptions of random functions, rather than discrete or continuous random variables~\citep{altland_condensed_2010}. A \emph{field} is a function assigning spatial coordinates to scalar values or vectors. Examples of physical fields are electric charge densities, water surfaces, or vector fields such as magnetic fields. Low-energy configurations of fields can display recurring patterns (e.g., waves) or undergo phase transitions (e.g., magnetism) upon varying model parameters.  
Since we can think of a function as an infinite-dimensional vector, field theory requires the usage of \emph{functional analysis} over plain calculus. For example, we frequently ask for the field that minimizes a free energy functional that we obtain by calculating a functional derivative that we set to zero. The advantage to moving to a function-space description is that all details about neural architectures are abstracted away as long as the neural network is sufficiently over-parameterized. 

Firstly, we propose abstracting the neural networks $\nnmu$ and $\nnLambda$ with nonparametric, twice-differentiable functions $\ftmu$ and $\ftp$ respectively. Since these functions no longer depend on parameters, we cannot use $L_2$ penalties. A somewhat comparable substitute is to directly penalize the output ``complexity'' of the models, which can be measured via the \emph{Dirichlet energy}: $\alpha\int p(x)||\nabla \ftmu(x)||_2^2dx$ and $\beta\int p(x) ||\nabla \ftp(x)||_2^2dx$. Note that these specific penalizations induce similar limiting behaviors for resulting solutions (i.e., $\alpha,\beta \rightarrow 0$ implies overfitting while $\rightarrow \infty$ implies constant functions). In the case where $\nnmu$ and $\nnLambda$ are linear models, this gradient penalty is equivalent to an $L_2$ penalty. Further, networks trained with an $L_2$ weight regularization have empirically been found to have lower \emph{geometric complexity}, a variant of \emph{Dirichlet energy} \citep{dherin_why_2022}. 
We also implement neural networks with \emph{geometric complexity} regularization and present those results in \cref{sec:ffmap-gc}.


Using the assumptions outlined above and the same reparameterization of $(\alpha, \beta)$ to $(\rho, \gamma)$ as with the neural networks, the cross-entropy objective can be interpreted as an action functional of a corresponding two-dimensional FT,
\begin{align}
\mathcal{L}_{\rho, \gamma}(\hat{\mu},\hat{\Lambda}) &= \int_{\mathcal{X}} p(x)\rho\int_{\mathcal{Y}} p(y\sep x)\log \hat p(y\sep x)dy 
 \\\nonumber &\quad +p(x)\bar{\rho}\left[\gamma||\nabla\hat{\mu}(x)||_2^2  + \bar{\gamma}||\nabla\hat{\Lambda}(x)||_2^2 \right] dx,\label{eq:nfe} 
\end{align}
where $\hat p(y\sep x)=\mathcal{N}(y\sep\hat{\mu}(x), \hat{\Lambda}(x))$. This description assumes a continuous data density $p(x)$, a continuous distribution over regression noise $p(y\sep x)$, and continuous functions $\hat{\mu}(x)$ and $\hat{\Lambda}(x)$ whose behavior we would like to study as a function of varying the regularizers $\rho$ and $\gamma$. 


One can view the indexed set $y(\cdot) = \{y(x)\}_{x \in \mathcal{X}}$ as a stochastic process (specifically a white noise process scaled by true precision $\Lambda(x)$ and shifted by true mean $\mu(x)$).
We are interested in the statistical properties of the field theory for any given realization of this stochastic process, $y(x)$, and ideally, we would average over multiple draws. However, for computational convenience, we restrict our attention to a single sample. This simplification is equivalent to considering a specific dataset and similar in spirit to fitting a heteroskedastic model to real data. 
This approximation yields the following simplified FT,
\begin{align}
\mathcal{L}_{\rho, \gamma}(\hat{\mu},\hat{\Lambda}) &\approx \int_{\mathcal{X}}p(x) \rho\bigg[\frac{1}{2}\hat{\Lambda}(x)\hat{r}(x)^2  -\frac{1}{2}\log \hat{\Lambda}(x)\bigg] \\\nonumber
&\quad +p(x)\bar{\rho}\left[\gamma||\nabla\hat{\mu}(x)||_2^2 + \bar{\gamma}||\nabla\hat{\Lambda}(x)||_2^2 \right]dx,
\end{align}
where $\hat r(x) := \hat \mu (x) - y(x)$. 
We are primarily interested in solutions $\ftmu^*$ and $\ftp^*$ that minimize the FT $\mathcal{L}_{\rho, \gamma}(\ftmu, \ftp)$ as these are roughly analogous to models $\nnmu$ and $\nnLambda$ that minimize penalized cross-entropy $\mathcal{L}_{\rho, \gamma}(\theta, \phi)$. We can gain insights into these solutions by taking functional derivatives of the FT with respect to $\hat{\mu}$ and $\hat{\Lambda}$ and setting them to zero. 

The result of this procedure are stationary conditions in the form of \emph{partial differential equations}
for %
$\hat \mu^*$ and $\hat \Lambda^*$:
\begin{align}
\ftp^*(x)\hat{r}^*(x) &= 
2\frac{\bar{\rho}}{\rho} \gamma\frac{\Delta\ftmu^*(x)}{p(x)}\nonumber \\
\text{ and } \quad \hat{r}^*(x)^2 &= 
\frac{1}{\ftp^*(x)}+ 4\frac{\bar{\rho}}{\rho}\bar{\gamma}\frac{\Delta\ftp^*(x)}{p(x)},
\label{eq:pdes}
\end{align}
where $\hat{r}^*(x) = \ftmu^*(x) - y(x)$ and $\Delta$ is the Laplace operator \citep{engel_density_2011}. Note that these equalities hold true \emph{almost everywhere} (a.e.) with respect to $p(x)$. 

Interestingly, both resulting relationships include a regularization coefficient divided by the density of $x$. Intuitively, while the functions as a whole have a global level of regularization dictated by $\rho$ or $\gamma$, locally this regularization strength is augmented proportional to how likely the input is. This means that areas of high density in $x$ permit more complexity, while less likely regions are constrained to produce simpler outputs. Similarly, since $\Delta \ftmu$ and $\Delta \ftp$ measure the \emph{curvature} of these functions, we see that $\rho$ and $\gamma$ directly impact the complexity of $\ftp$ and $\rho$, as we expect. 



\paragraph{Numerically Solving the FT}
Since the stationary conditions in~\cref{{eq:pdes}} are too complex to be solved analytically, we discretize and minimize the FT to arrive at approximate solutions---in theory, we can do so to arbitrary precision. 
Let $\{x_i\}_{i=1}^{N_{D}}$ be a set of $D$ fixed points in $\mathcal{X}$ that we assume are evenly spaced. 
Define $\ftmud, \ftpd, \vec y$ to be $N_{D}$-dimensional vectors where for each $i$, $\ftmud_i := \ftmu(x_i), \ftpd_i := \ftp(x_i), y_i := y(x_i)$. 
We solve for the optimal $\ftmud$ and $\ftpd$ using the discretized approximation to \cref{eq:nfe} via gradient based optimization methods:
\begin{align}
\mathcal{L}_{\rho, \gamma}(\ftmud,\ftpd) &\approx \sum_{i=1}^{N_{D}} \rho\left[\frac{1}{2}\ftpd_i\left(y_i-\ftmud_i\right)^2  -\frac{1}{2}\log \ftpd_i\right] \nonumber\\
&\qquad+\bar{\rho}\left[\gamma||\nabla\ftmud_i||_2^2 + \bar{\gamma}||\nabla\ftpd_i||_2^2 \right],
\end{align}
and numerically approximate the gradients of $\ftmu, \ftp$ by finite-difference methods \citep{fornberg_generation_1988}.

\paragraph{FT Insights}
The pair of constraints in \cref{eq:pdes} allow us to glean useful insights into the resulting regularized solutions by looking at edge cases of specific combinations of $\rho$ and $\gamma$ values.
We summarize the theoretical properties of the limiting cases of $\rho$ and $\gamma$ approaching extreme values in the proposition below and in \cref{tab:cases}. Please refer to \cref{sec:app_proofs} for the proofs of these claims.

\begin{prop}\label{prop:existence}
Under the assumptions of our FT (see above), the following properties hold: (i) in the absence of regularization ($\rho = 1$), there are no solutions to the FT; (ii) in the absence of data ($\rho = 0$), there is no unique solution to the FT; and (iii) in order for there to exist a solution to the FT there must be regularization on the mean function.
\end{prop}

These limiting cases match our intuition conveyed earlier that also apply to the neural network context. Furthermore, if we assume that there do exist valid solutions for $\gamma,\rho \in (0, 1)$, it follows that the solutions should either undergo sharp transitions or smooth cross-overs between the behaviors described in the limiting cases when varying the regularization strengths. Section~\ref{sec:experiments} shows that, empirically, these phase diagrams resemble \cref{fig:cartoonphases}. We leave the analytical justification for the types of boundaries and their shapes and placement in the phase diagram for future work.



\begin{table*}[htp!]
\centering
\small
\caption{Comparison of a deep heteroskedastic regression model with diagonal regularization search with $\beta$-NLL \citep{seitzer_pitfalls_2022} and two conformal prediction implementations. For details on the selection criteria of the heteroskedastic model see \cref{sec:diag-crit}. The final two columns are comparisons against models selected in the same way as in our suggestion, but trained on half of the data in a split conformal fashion. The third column has uniform bandwidth (homoskedastic) assumptions while the fourth column has a locally adaptive \citep{lei_distribution-free_2018} (heteroskedastic) bandwidth. Possibly due to the reduced training size, performance suffers. In a conformal setting the "standard deviation" does not have an obvious analogue. We calibrate the bandwidths to be set to the 0.682 quantile because $\pm$ 1 standard deviation covers $\approx68.2\%$ of a standard normal distribution. Lowest mean value of for each quantity is bolded. We report the average and standard deviations of $\mu$- and $\Lambda^{-\frac{1}{2}}$-MSE across six runs on test data.}


\begin{tabular}{ lr| *{4}{c} }
\toprule
{Dataset}&
{Metric} &
{Heteroskedastic} &
{$\beta$-NLL}&
{Conformal}&
{Conformal (local)} \\
\midrule
Sine &$\mu$ MSE   & 0.80 ± 0.00 & 0.69 ± 0.05 & \textbf{0.54} ± 0.09 & 0.82 ± 0.00\\
& $\Lambda^{-\frac{1}{2}}$MSE   & 0.80 ± 0.00 & 0.52 ± 0.07 & 0.36 ± 0.13 &  \textbf{0.33} ± 0.00\\
\midrule
Concrete & $\mu$ MSE  & \textbf{0.11} ± 0.02 & 0.55 ± 0.30  & 0.27 ± 0.01 & 0.80 ± 0.00  \\
& $\Lambda^{-\frac{1}{2}}$MSE  & 0.30 ± 0.51 & 1.09 ± 0.20 & \textbf{0.09} ± 0.00  & 0.43 ± 0.00 \\
Housing & $\mu$ MSE   & 1.22 ± 0.00 & 0.32 ± 0.05 & \textbf{0.31} ± 0.00 & \textbf{0.31} ± 0.00\\
& $\Lambda^{-\frac{1}{2}}$MSE   & 0.76 ± 0.00 & 0.88 ± 0.03   & \textbf{0.13} ± 0.00 & 0.14 ± 0.00  \\
Power & $\mu$ MSE    & \textbf{0.04} ± 0.01 & 0.09 ± 0.01 & 0.18 ± 0.00 & 0.19 ± 0.00    \\
& $\Lambda^{-\frac{1}{2}}$MSE    & \textbf{0.03} ± 0.01 & 0.31 ± 0.37 & \textbf{0.03} ± 0.00 & \textbf{0.03} ± 0.00 \\
Yacht & $\mu$ MSE    & \textbf{0.01} ± 0.01 & \textbf{0.01} ± 0.01 & 0.04 ± 0.00 & 0.84 ± 0.00  \\
& $\Lambda^{-\frac{1}{2}}$MSE    & \textbf{0.01} ± 0.01 & 1.33 ± 0.02 & \textbf{0.01} ± 0.00 & 0.49 ± 0.00  \\
\midrule
Solar Flux& $\mu$ MSE  & 0.29 ± 0.00 & 0.38 ± 0.00  & \textbf{0.05} ± 0.00 & 0.33 ± 0.00  \\
& $\Lambda^{-\frac{1}{2}}$MSE  & 0.12 ± 0.00 & 0.32 ± 0.00 & \textbf{0.01} ± 0.00 & 0.37 ± 0.01\hspace{0.85em}\\
\bottomrule
\end{tabular}
\label{tab:baseline_sub}
\end{table*}