\section{Experiments}
\label{sec:experiments}
The main focus of our experiments is to visualize the phase transitions in two-dimensional phase diagrams and identify summary statistics ("observables") that display them. 
We establish that these properties are independent of any particular neural network architecture by showing qualitative agreement with the field theory. 
Finally, through this exploratory analysis we discovered a method for finding well-suited combinations of $(\rho, \gamma)$-regularization strengths that reduces a two-dimensional hyperparameter search to one-dimension, allowing for the efficient identification of heteroskedastic model fits that neither over- nor underfit. 

\paragraph{Modeling Choices} 
We chose $\nnmu, \nnLambda$ to be fully-connected networks with three hidden layers of 128 nodes and leaky ReLU activation functions.  
The first half of training was only spent on fitting $\nnmu$, while in the second half of training, both $\nnmu$ and $\nnLambda$ were jointly learned. 
This improves stability, since the precision is a dependent on the mean $\nnmu$, and is similar in spirit to ideas presented in \citet{skafte_reliable_2019}. 
Complete training details can be found in \cref{sec:training}. 

\paragraph{Datasets}
We analyze the effects of regularization on several one-dimensional simulated datasets, standardized versions of the \emph{Concrete} \citep{yeh_i-cheng_concrete_2007}, \emph{Housing} \citep{harrison_hedonic_1978}, \emph{Power} \citep{tufekci_prediction_2014}, and \emph{Yacht} \citep{gerritsma_geometry_1981} regression datasets from the UC Irvine Machine Learning Repository~\citep{kelly_uci_nodate}, and a scalar quantity from the ClimSim dataset \citep{yu_climsim_2023}. 
We fit neural networks to the simulated and real-world data and additionally solve our FT for the simulated data. 
Detailed descriptions of the data are included in \cref{sec:datasets}. 
We present the results for a simulated sinusoidal (\emph{Sine}) dataset as well as the four UCI regression datasets and have results for the other simulated datasets in \cref{sec:nn-sim}. 


\subsection{Qualitative Analysis}

Our qualitative analysis aims at understanding architecture-independent aspects of heteroskedastic regression upon varying the regularization strength on the mean and variance functions,  resulting in the observation of phase transitions. 

\paragraph{Metrics of Interest}
We are interested in how well-calibrated the resulting models are as well as how expressive the learned functions are. 
We compute two types of metrics on our experiments to summarize these properties.
Firstly, we consider the mean squared error (MSE). 
We measure this quantity between predicted mean $\nnmu(x_i)$ and target $y_i$, as well as between predicted standard deviation ($\Lambda^{-1/2}(x_i)$) and absolute residual $|\nnmu(x_i)-y_i|$.
If the mean and standard deviation are well-fit to the data, both of these values should be low. 
We opt for $\Lambda^{-\frac{1}{2}}$ MSE due to its similarities to variance calibration \citep{skafte_reliable_2019} and expected normalized calibration error \citep{levi_evaluating_2022}. 
Secondly, we evaluate the Dirichlet energy for the FT and its discrete analogue, geometric complexity \citep{dherin_why_2022}, for neural networks of the learned $\nnmu, \nnLambda, \ftmud, \ftpd$. As previously mentioned, the Dirichlet energy of a function $f$ is defined as $\int_{\mathcal{X}} p(x) ||\nabla f(x)||_2^2 \, dx$. Meanwhile, geometric complexity is $N^{-1} \sum_{i=1}^N ||\nabla f(x)||_2^2$. 
Each quantity captures how expressive a learned function is, with more expressive functions yielding a higher value and is analogous (or equivalent) to the quantity we penalize in the FT setting.


\paragraph{Plot Interpretation}
We present summaries of the fitted models in grids with $\rho$ on the $x$-axis and $\gamma$ on the $y$-axis in \cref{fig:summary}. 
The far right column ($\gamma=1$) corresponds to MLE solutions. 
The main focus is on qualitative traits of fits under different levels of regularization and how they behave in a relative sense, rather than a focus on absolute values. 
\cref{fig:diag_slice_a} show the summary statistics along the slice where $\rho = 1-\gamma$. 
Zero on these plots corresponds to the upper left corner while one corresponds to the lower right corner. 
We provide model fits arranged in grids of the same orientation for the field theory and neural networks on the \emph{Sine} dataset in \cref{fig:ft-fits,fig:mlp-fits}.

\textbf{Observation 1:}
\emph{Our metrics show sharp phase transitions upon varying $\rho, \gamma$, as in a physical system.}

\cref{fig:summary} and \cref{fig:diag_slice_a} show a sharp transition, both leading to worsening and improving performance when moving along the minor diagonal. 
In totality, across all metrics, the five regions are apparent.
But not all of the regions in \cref{fig:cartoonphases} appear in the heatmaps of each metric. 
For example, region $O_\text{II}$ does not always appear in the metrics related to the mean.  
When using neural networks to approximate $\mu$ and $\Lambda$, there are sharper boundaries between phases than in the FT's numerical solutions. 
The boundary between $U_\text{II}$ and $O_\text{I}$ is sharply observed in the plots of $\int ||\nabla \mu(x)||_2^2 \, dx$. 
However, in terms of $\mu$ MSE, a smoother transition (i.e., region $S$) is visible. 


\textbf{Observation 2:}
\emph{The FT insights and observed phases are consistent with the numerically solved FT and the results from fitting neural networks. Thus, our results are not tied to a specific architecture or dataset.}

In alignment with our theoretical insights, phases $U_\text{I}$ and $O_\text{I}$ exhibit consistent behavior across $\gamma$-values (vertical slices in \cref{fig:summary}). 
Qualitatively, we find the same types of phase diagrams and phase transitions across all considered data sets.
Empirically, we observe that boundaries between regions of interest are similar in shape across datasets but not quantitatively the same, i.e., phase transitions occur at differing levels of regularization for different data sets of different dimensionality. 

In the right-hand columns $(\rho \to 1)$, there is near-perfect matching of the data by the mean function and this is also visible in the lower rows $(\gamma \to 0)$. 
Within the metrics we assess, the shapes of the regions vary with regularization strength in a similar fashion on all datasets. 
In the plots of $\int ||\nabla \Lambda(x)||_2^2 \, dx$, the region where $\Lambda$ is flatter covers a larger area compared to the phase diagram showing $\int ||\nabla \mu(x)||_2^2 \, dx$. 
That is, for the same proportion of regularization as the mean, the precision remains flatter. 



\subsection{Quantitative Analysis}
Our quantitative analysis aims to demonstrate the practical implications of our qualitative investigations that result in better calibration properties.

\textbf{Observation 3:}
\emph{We can search along $\rho = 1-\gamma$ to find a well-calibrated $(\rho, \gamma)$-pair from region $S$.}

Our FT indicates that a slice across the minor diagonal of the phase diagram should always cross the $S$ region (see \cref{fig:cartoonphases}).
\cref{fig:diag_slice_a} show that by searching along this diagonal, we indeed find a combination of regularization strengths where both $\nnmu$ and $\nnLambda$ generalize well to held-out test data. 
This implies that there is no need to search all of the two-dimensional space, but only a single slice which reduces the 
number of models to fit from $O(N^2)$ to $O(N)$, where $N$ is the number of $\rho$ and $\gamma$ values that are tested.  

\cref{fig:diag_slice_a} shows that along the minor diagonal the performance is initially poor, improves, and then drops off again. 
These shifts from strong to weak performance are sharp. 
The regularization pairings that result in optimal performance with respect to $\mu$- and $\Lambda^{-1/2}$-MSE are near each other along this diagonal for the real-world test data. 
As the theory predicts, the performance becomes highly variable as we approach the MLE solutions and the FT fails to converge in this region. 
In practice, we propose searching along this line to find the $(\rho, \gamma)$-combinations that minimize the $\mu$- and $\Lambda^{-\frac{1}{2}}$-MSEs and averaging the regularization strengths to fit a model.  
We compare models chosen by our diagonal line search to two heteroskedastic modeling baselines in \cref{sec:baseline_comp} on the synthetic and UCI datasets as well as a scalar quantity from the ClimSim dataset \citep{yu_climsim_2023}. We present a subset of the results below in Table \ref{tab:baseline_sub}. In most cases the model chosen via the diagonal line search was competitive or better than the baselines.







\begin{figure*}[!htb]
\centering
\begin{subfigure}[b]{0.87\textwidth}
   \includegraphics[width=.9\linewidth]{figures/diag_slice_a_fixed.pdf}
\end{subfigure}
\begin{subfigure}[b]{0.87\textwidth}
   \includegraphics[width=.9\linewidth]{figures/diag_slice_b.pdf}
\end{subfigure}
\caption{Test metrics for six runs achieved along the $\rho=1-\gamma$ minor diagonal. 
Stars indicate minimum MSE values. 
All metrics are reported on a $\log_{10}$ scale. 
$\rho$ values are shown on a logit scale with $\overline{10^k}:=1-10^k$. 
From left to right, note the sharp decrease in test metric values, especially in the solutions to neural network models followed by a typical smoother increase. 
This empirically supports the existence of the well-calibrated $S$ phase shown in \cref{fig:cartoonphases}.
}
\label{fig:diag_slice_a}
\end{figure*}
