\documentclass[accepted]{uai2024} % for initial submission
%\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\input{math_commands.tex}

\usepackage{url}
%\usepackage{cleveref}
\usepackage{textcomp}
\usepackage{amssymb}
\usepackage[linesnumbered,ruled,vlined]{algorithm2e}
\usepackage{relsize}
\usepackage{algpseudocode}
\usepackage{float}
\usepackage{multirow}
\usepackage{caption}
\usepackage{enumitem}
\usepackage{makecell}
\usepackage{tablefootnote}

%\usepackage[english]{babel}

\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{hyperref}       % hyperlinks
\usepackage{url}            % simple URL typesetting
\usepackage{booktabs}       % professional-quality tables
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{microtype}      % microtypography
\usepackage{xcolor}         % colors

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{amsthm}
\usepackage{array}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{xy}

\usepackage{times}
\usepackage{amssymb}
\usepackage{multirow}
\usepackage{subcaption}
\usepackage{comment}
\usepackage{bbm}

\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
\newtheorem{example}{Example}
\newcommand\thickbar[1]{\accentset{\rule{.5em}{0.6pt}}{#1}}

\newtheorem*{proposition1}{Proposition~\ref{prop:basic}}
\newtheorem*{proposition2}{Proposition~\ref{thm:rlp_loss_convex}}
\newtheorem*{proposition3}{Proposition~\ref{thm:better_loss}}

\usepackage{hyperref} 

 \newcommand{\ind}{\perp\!\!\!\!\perp} 
\newcommand{\A}[1]{{\color{blue}[\textbf{Ahmed:}#1]}}
\newcommand{\SH}[1]{{\color{red}[\textbf{Shyam: }#1]}}
\usepackage{xcolor}  
\newcommand\eqnum{\addtocounter{equation}{1}\tag{\theequation}}
\def\de{\overset{\Delta}{=}}

\title{Random Linear Projections Loss for Hyperplane-Based \\ Optimization in Neural Networks}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Shyam Venkatasubramanian\textsuperscript{*}}
\author[1]{Ahmed Aloui\textsuperscript{*}}
\author[1]{Vahid Tarokh}
% Add affiliations after the authors
\affil[1]{
    Department of Electrical and Computer Engineering, Duke University
}

\begin{document}
\maketitle
\let\thefootnote\relax\footnotetext{$^*$Equal contribution. GitHub: \href{https://github.com/AhmedAloui1997/RandomLinearProjections}{https://tinyurl.com/34bm8r9c}.}

\begin{abstract}
Advancing loss function design is pivotal for optimizing neural network training and performance. This work introduces Random Linear Projections (RLP) loss, a novel approach that enhances training efficiency by leveraging geometric relationships within the data. Distinct from traditional loss functions that target minimizing pointwise errors, RLP loss operates by minimizing the distance between sets of hyperplanes connecting fixed-size subsets of feature-prediction pairs and feature-label pairs. Our empirical evaluations, conducted across benchmark datasets and synthetic examples, demonstrate that neural networks trained with RLP loss outperform those trained with traditional loss functions, achieving improved performance with fewer data samples, and exhibiting greater robustness to additive noise. We provide theoretical analysis supporting our empirical findings.
\end{abstract}

% \A{New abstract structure: focus on presenting the new loss instead of discrediting MSE. Start with a general statement about why we need loss functions, and talk about point-wise deviations vs learning the shape ...}

\section{Introduction}
\label{intro}

Deep Neural Networks have achieved success across various applications, including computer vision~\citep{lecun1995convolutional,krizhevsky2012imagenet,minaee2021image}, natural language processing ~\citep{hochreiter1997long,vaswani2017attention,radford2018improving}, generative modeling~\citep{goodfellow2020generative,kingma2019introduction,song2020score}, and reinforcement learning~\citep{mnih2013playing,van2016deep,haarnoja2018soft}. Foundational to these fields are the tasks of regression and classification, in which neural networks have been empirically shown to outperform conventional techniques ~\citep{reddy2012image}. Training neural networks relies on the principle of empirical risk minimization (ERM)~\citep{vapnik1993local}, which aims to optimize the average loss on observed data to ensure model generalization. ERM relies on the development of state-of-the-art loss functions to minimize the generalization error, enabling better convergence for diverse tasks.

Among the most popular loss functions used to train neural networks are Mean Squared Error (MSE) and Cross Entropy, tailored to regression and classification tasks, respectively. MSE measures the average squared differences between the observed values (labels) and model outcomes (predictions) while Cross Entropy assesses the divergence between class labels and predicted probabilities --- both MSE and Cross Entropy are measures of local pointwise deviation, as they compare individual predictions with their labels. Neural networks trained with these loss functions have achieved state-of-the-art performance across benchmark datasets for regression and classification (e.g., California Housing~\citep{geron2022hands} and MNIST~\citep{deng2012mnist}).

Despite achieving state-of-the-art performance on benchmark datasets, neural networks trained with MSE and Cross Entropy also face significant challenges. Empirical evidence suggests these models often converge more slowly to optimal solutions, which affects training efficiency~\citep{livni2014computational,bartlett2002hardness,blum1988training}. Additionally, their performance can be limited when overparameterized~\citep{aggarwal2018neural}, and the presence of additive noise may result in unstable behavior and variable predictions~\citep{Feng2020CanCE}. These issues underline the limitations of neural networks optimized with loss functions akin to MSE and Cross Entropy, underscoring the necessity for more effective training methodologies.

In the deep learning literature, several methods have been proposed to address the aforementioned challenges. In particular, it is commonplace to regularize the weights of the neural network (e.g., $L_2$ regularization~\citep{krogh1991simple}). However, these regularization approaches usually assume the existence of a prior distribution over the model weights. Another approach is to modify the gradient descent optimization procedure itself. In particular, SGD~\citep{rumelhart1986learning}, SGD with Nesterov momentum~\citep{nesterov1983method}, Adam~\citep{kingma2014adam}, AdamW~\citep{loshchilov2017decoupled}, and Adagrad~\citep{duchi2011adaptive} are examples of such optimizer variations. On the other hand, rather than altering the neural network training procedure, data preprocessing methods, especially data augmentation techniques, have been proven successful in computer vision, speech recognition, and natural language processing applications~\citep{van2001,chawla2002smote,han2005borderline,jiang2020meshcut, chen2020gridmask, feng2021survey}. Among these data augmentation strategies, mixup~\citep{zhang2017mixup} has been proposed as a means of mitigating the vulnerabilities discussed above.
% \A{this feels like an attack on the traditional methods. I think we can just remove this sentence and present our work as a nice new addition to what already exists, especially as we don't compare our work to all possible conceivable ways of optimization with MSE (nesterov, newton, other data augmentation methods in computer vision, NLP, etc. }

% paragraph5: Provide intuition without specifying RLP. Discuss underlying structure and whether we can learn the function describing the feature-label pairs using the feature-prediction pairs. Provide illustrative example (if two functions share the same feature-label pairs, then they must necessarily be equivalent).

Recalling that MSE and Cross Entropy are measures of local pointwise deviation, we seek to answer a fundamental question: does the consideration of non-local properties of the training data help neural networks achieve better generalization? Firstly, as depicted in Figure \ref{fig:rlp_illustration}, we note that if two functions share the same hyperplanes connecting all subsets of their feature-label pairs, then they must necessarily be equivalent. Extending this knowledge to deep learning, if the distance between sets of hyperplanes connecting fixed-size subsets (batches) of the neural network’s feature-prediction pairs and feature-label pairs approaches zero, then the predicted function represented by the neural network converges to the true function (the true mapping between the features and labels). If a loss function were to incorporate this intuition, it would be able to capture non-local properties of the training data, addressing some of the limitations presented by the traditional training approach.

In this light, we introduce \textit{Random Linear Projections (RLP) loss}: a hyperplane-based loss function that captures non-local linear properties of the training data to improve model generalization. More concretely, we consider a simple example to illustrate RLP loss. Suppose we have a training dataset consisting of $d$-dimensional features and real-valued outcomes. To train a given neural network with RLP loss, we first obtain as many fixed size ($M \geqslant d+1$) subsets of feature-label pairs as possible. Across all such subsets, we obtain a corresponding subset of feature-prediction pairs, where the predictions are the outcomes of the neural network. Subsequently, we learn the corresponding regression matrices~\citep{van1987new}, and we minimize the distance between the hyperplanes associated with these matrices. We note that this method does not assume the true function is linear, as the large number of fixed-size subsets of feature-label pairs (random linear projections) encourages the neural network to capture potential nonlinearities. 

%intuition figure
\begin{figure}
    \centering
    \includegraphics[width=7.7cm]{figures/figure_0.pdf}
    \caption{Comparing true and predicted functions: illustration that two functions are equivalent iff they share identical hyperplanes generated by all possible feature-label pairs.}
    \label{fig:rlp_illustration}
\end{figure}

The outline of this paper is as follows. In Section \ref{SecTheory}, we mathematically formalize RLP loss and prove relevant properties. In Section \ref{SecAlgorithm}, we delineate the algorithm for generating fixed-sized subsets of feature-label pairs from the training data. In Section \ref{SecEmpiricalResults}, we provide empirical results demonstrating that neural networks trained with RLP loss achieve superior performance when compared to MSE loss and Cross Entropy loss. Finally, in Section \ref{SecConclusion}, we summarize our work. Our contributions are summarized below:

\begin{enumerate}
    \item We introduce \textit{Random Linear Projections (RLP) loss}, a new loss function that leverages geometric relationships to capture non-local linear properties.
    \item We prove that neural networks trained with \textit{RLP loss} learn the optimal function when the loss is minimized, and that they converge faster than those trained with MSE loss when certain properties hold.
    \item We propose an algorithmic procedure to generate fixed-size subsets of feature-label pairs that are necessary for training neural networks with \textit{RLP loss}.
    \item We demonstrate that neural networks trained with \textit{RLP loss} achieve better performance and converge faster than those trained with MSE and Cross Entropy loss.
    
\end{enumerate}


\paragraph{Related work.} 
There are two primary methods for enhancing the performance of neural networks trained with MSE loss and Cross Entropy loss. On one hand, incorporating regularization during training is a prevalent approach~\citep{wang2020regularization,zhang2018three}. For instance, in $L_2$ regularization~\citep{krogh1991simple}, the loss function is altered to incorporate the weighted $L_2$ norm of the weights during optimization. This discourages excessively large weights, thereby preventing overfitting. Other proposed regularization techniques include $L_1$ regularization~\citep{tibshirani1996regression,lv2009unified} and adaptive weight decay~\citep{nakamura2019adaptive}. On the other hand, data augmentation techniques, such as mixup~\citep{zhang2017mixup,zhang2020does}, go beyond empirical risk minimization and have demonstrated increased robustness against noise and adversarial attacks --- mixup trains a neural network on convex combinations of pairs of examples and their corresponding labels. In our study, we choose a different direction by changing the MSE loss function itself. We aim to minimize the distance between sets of hyperplanes that connect fixed-size subsets of the neural network's feature-prediction pairs and feature-label pairs. While it is conceivable to integrate both regularization and data augmentation methods into our proposed loss function, we reserve that exploration for future research.


\section{Theoretical Results} 
\label{SecTheory}
Let $\{(X_i,Y_i)\}_{i=1}^M$ denote a set of independent and identically distributed (i.i.d) random variables, where $X_i \in \mathbb{R}^d$ is the feature vector with dimension, $d$, $Y_i \in \mathbb{R}$ is the corresponding label, and $M$ is the number of considered random variables (assumed to be strictly greater than $d$). Now, let $\mathbf{X}$ denote the  matrix in $\mathcal{M}_{M,d}(\mathbb{R})$ such that the i$^{\text{th}}$ row of the matrix corresponds to the vector, $X_i$. Similarly, let $\mathbf{Y}$ be the vector in $\mathbb{R}^M$ such that its i$^\text{th}$ element corresponds to $Y_i$.

Furthermore, we define $\mathcal{H} \subset \{h: \mathbb{R}^d \to \mathbb{R}\}$ as the class of hypothesis functions that model the relationship between $X_i$ and $Y_i$. In our empirical setting, we let $\mathcal{H}$ denote the set of neural networks that have predetermined architectures. Subsequently, we delineate $\mathbf{h}: \mathcal{M}_{M,d}(\mathbb{R}) \to \mathbb{R}^M$, where $\mathbf{X} \mapsto \left(h(X_1),\ldots,h(X_M)\right)^{\top}$ denotes the extension of the hypothesis, $h$, over the space of matrices, $\mathcal{M}_{M,d}(\mathbb{R})$. 

We begin by defining the MSE loss function, the standard measure for regression tasks, and subsequently introduce our proposed Random Linear Projections (RLP) loss. 
\begin{definition}[MSE Loss]
The MSE loss function is defined as,
    $$L_0(h) = \mathbb{E}\left[\|h(X)-Y\|^2\right]$$
where $(X,Y)$ and $\{(X_i,Y_i)\}_{i=1}^M$ are independent and identically distributed (i.i.d) random variables.
\end{definition}
\begin{definition}[Random Linear Projections Loss]
\label{def:rlp}
The RLP loss function is defined as,
$$\mathcal{L}(h) = \mathbb{E}\Biggl[\biggl\| \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\left(\mathbf{Y} - \mathbf{h}(\mathbf{X})\right)\right)^{\top} X\biggl\|^2  \Biggr]$$
where the expectation is taken over the probability density, $p(X,X_1,Y_1, \ldots X_M,Y_M)$, with $X$ being independent of and identically distributed to $\{X_i\}_{i=1}^M$.
\end{definition}
The proposed definition for RLP loss is based on the observation that $\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{Y}$ and $\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\mathbf{h}(\mathbf{X})$ represent the regression matrices that solve the linear problem of regressing a subset of observed outcomes and predicted outcomes, respectively, on their associated features. Consequently, RLP loss seeks to minimize the disparity between all conceivable \textit{predicted} hyperplanes and \textit{observed} hyperplanes. In this study, we opt to minimize the distance between these hyperplanes by evaluating the images of points drawn from the support using the random variable, $X$. This approach provides us with points from the hyperplanes, allowing us to minimize the squared distance between them.
Now, we present the following proposition proving that the solution for RLP is optimal.
%\A{This proposition is false, instead of "with equality iff" it should be with the the hypothesis minimizing the loss being h(x) = ...}
\begin{proposition}
\label{prop:basic}
Let $h\in \mathcal{H}$ be a hypothesis function. We observe that $\mathcal{L}(h)\geq 0$ with the hypothesis minimizing the loss being $h(x)=\mathbb{E}\left[Y|X=x\right]$ almost surely.
\end{proposition}
This proposition ensures that the optimal hypothesis function, $h$, aligns with the conditional expectation of $Y$ given $X=x$, almost everywhere.

Let us now consider a set of parameterized functions denoted by $\mathcal{H} = h_\theta$, where $\theta \in \Theta$. For simplicity, we represent the loss function as $\mathcal{L}(\theta)$ in place of $ \mathcal{L}(h_\theta)$. 

% \A{maybe we move this result to the appendix as it is a weak result and it will only raise questions from the reviewers.}
% \begin{proposition}
%     \label{thm:rlp_loss_convex}
%     For a class of hypotheses restricted to linear models (i.e., $h_{a,b}(X) = a^\top X + b$ with $a \in \mathbb{R}^d$ and $b\in \mathbb{R}$), there exists a $\delta > 0$ such that for every $\epsilon < \delta$,
%     $$ \mathcal{L}\left(\theta - \epsilon \nabla_{\theta} \mathcal{L}(\theta)\right) \leqslant \mathcal{L}(\theta)$$
% \end{proposition}
% This proposition highlights the convexity property of the loss function for linear models. More concretely, it means that for small enough adjustments in the direction of the gradient, the loss function is non-increasing.

In the following proposition, we assume that the class of hypothesis functions, $\mathcal{H}$, is fully defined by a vector of parameters, $\theta \in \mathbb{R}^W$. In our empirical setting, this corresponds to the class of neural networks with predetermined architectures.
\begin{proposition}
\label{thm:better_loss}
Let $L_0$ denote the MSE loss and let $\theta^{*}$ be the optimal parameters (i.e., $h_{\theta^*} = \mathbb{E}\left[Y|X\right]$ almost surely). We assume that both the MSE and RLP loss functions are convex. 
Under the following conditions:
\begin{enumerate}[label=(\roman*)]
    \item $\mathbb{E}\left[X_{i} X_{j}\right]= [1,\cdots,1]^{\top}\mathbbm{1}_{i=j}$.
    \item $(\mathbf{Y}-h_{\theta}(\mathbf{X})) \leqslant 0 $ and $\nabla_{\theta} h_{\theta}(\mathbf{X}) \leqslant 0 \quad$ \\
    (component-wise inequality).
    \item For every $j,k \in \{1,2,\ldots,d\}$ and for every $l \in \{1,2,\ldots,M\},\; \mathbb{E}\left[\mathbf{a}_{j k} \mathbf{a}_{l k}\right] \geqslant \frac{1}{d^{2}}$, where
    $\left(\mathbf{a}_{jk}\right)$ and $\left(\mathbf{a}_{lk}\right)$ are the components of $\mathbf{A}=\left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top}$.
\end{enumerate}
We observe that for every step size $\epsilon \geqslant 0$ and parameter $\theta \in \mathbb{R}^W$ for which gradient descent converges,
$$\; \left\| \theta^{*} - \left(\theta - \epsilon \nabla_{\theta}\mathcal{L}(\theta)\right) \right\| \leqslant \left\|\theta^{*}  - \left(\theta - \epsilon\nabla_{\theta}L_0(\theta)\right) \right\|$$
\end{proposition}
This proposition contrasts the convergence behavior of the two loss functions, MSE and RLP, for gradient descent optimization in parameterized models. It asserts that under certain conditions --- (\textit{i}), (\textit{ii}), and (\textit{iii}) from Proposition \ref{thm:better_loss} --- updates based on the gradient of the RLP loss function bring the parameters closer to the optimal solution than those based on the gradient of the MSE loss function.


\section{Algorithm}
\label{SecAlgorithm}

In this section, we detail our methodology for training neural networks using the Random Linear Projections (RLP) loss. Our approach comprises two main steps. First, we employ the \textit{balanced batch generation} strategy to sample unique batches from the training dataset. Subsequently, we utilize these batches to train a neural network model using gradient descent and our proposed RLP loss. 

Let $J = \{(x_i,y_i)\}_{i=1}^N$ denote the observed training dataset, where $x_i \in \mathbb{R}^d$ and $y_i \in \mathbb{R}$. Let $M\ll N$ be the number of training examples used to identify the regression matrices of the different hyperplanes, where $M$ is denoted as the batch size. Let $P = \frac{N!}{(M+1)! (N-M-1)!}$. The RLP loss is computed by examining all possible combinations of size $M+1$ from the training data. For each combination, regression matrices are constructed using the first $M$ components. Subsequently, the dot product is calculated between this regression matrix and the 
$(M+1)^{\text{th}}$ component. Hence the proposed empirical RLP loss function can be defined as follows:
$$
L(\theta) = \frac{1}{P}\sum_{j=1}^{P} \left(\left(\left(\textbf{x}_j^{\top}\textbf{x}_j\right)^{-1} \textbf{x}_j^{\top} \left(\textbf{y}_j - \textbf{h}_{\theta}(\textbf{x}_j)\right)\right)^{\top} x_j\right)^2
$$
Above, $\textbf{x}_j = \left(x_{j_1},\ldots,x_{j_M}\right)^{\top}$ is the matrix in $\mathcal{M}_{M,d}(\mathbb{R})$, whose rows correspond to $M$ different $x_{j_k}$ from the set of training data feature vectors, $\mathbf{y}_j =\left(y_{j_1},\ldots,y_{j_M}\right)^{\top}$ denotes the corresponding labels, and $x_j$ denotes an observed feature vector distinct from all rows comprising matrix $\textbf{x}_{j}$. It is important to note that by invoking the law of large numbers, the empirical RLP will converge in probability to the RLP loss (Definition \ref{def:rlp}). Given that the number of permutations can be exceedingly large, our approach for training the regression neural network with the RLP loss involves randomly sampling $K$ batches from the $P$ possible batches of size $M$ that comprise the training dataset, $J$.

\subsection{Balanced Batch Generation}
The objective of \textit{balanced batch generation} is to produce batches from the training dataset such that each example appears in at least one batch, where no two batches are identical. Let $J$ denote the training dataset, with corresponding labels, $M$ be the size of each batch, and $K$ be the total number of batches we intend to generate. To construct balanced batches, $\mathcal{B}$, from $J$, our methodology involves a continuous sampling process, ensuring each data point is incorporated in at least one batch. To maintain the uniqueness of batches and avoid repetitions, we employ a tracking set, $\mathcal{I}$.
\begin{algorithm}[h!]
\caption{Balanced Batch Generator}
\SetKwInput{KwInput}{Input}         % Set the Input
\SetKwInput{KwOutput}{Output} 
\SetKw{Break}{break}
\label{alg:bbg}
\DontPrintSemicolon
\KwInput{$J$ (Training dataset), $M$ (Batch size), \linebreak $K$ (Number of batches to generate)}
\KwOutput{$\mathcal{B}$ (Set of generated batches)}
$\mathcal{I} \gets \{0, 1, \dots, |J|-1\}$ (Initialize set of all indices) \\
$\mathcal{B} \gets \emptyset$ (Initialize set of generated batches) \\

\While{$|\mathcal{B}| < K$}{
    $\text{Randomly shuffle } \mathcal{I} \text{ to obtain } \mathcal{I}_{\text{shuffled}}$ \\
    \For{$i = 0,M,2M,\ldots,|J|-M$}{
        $b \gets \{J[\mathcal{I}_{\text{shuffled}}[i:i+M]]\}$ \\
        \If{$b \notin \mathcal{B}$}{
            $\mathcal{B} \leftarrow \mathcal{B} \cup \{b\}$
        }
        \If{$|\mathcal{B}| \geqslant K$}{
            \Break
        }
    }
}
\Return{$\mathcal{B}$}
\end{algorithm}

The main loop facilitates the consistent sampling of unique batches until we accumulate a total of $K$ batches. Within this loop, we first generate a full sequence of dataset indices, followed by a shuffle operation to ensure randomness. Iteratively, we then allocate train examples to batches in strides of size $M$. As each batch is formed, we check for its existence within our $\mathcal{I}$ set to uphold the uniqueness principle. This operation continues until we have attained our target number of unique batches, $K$.

Per Algorithm \ref{alg:bbg}, we observe that each training example in $J$ appears in at least one batch and that no two batches in $\mathcal{B}$ are identical. Subsequently, during each training epoch, we iterate over the $K$ randomly sampled batches and employ the Random Linear Projections loss. Subsequently, the algorithm for training a neural network using gradient descent with the RLP loss is provided in Algorithm~\ref{alg:nnrlp}.
\begin{algorithm}[h!]
\caption{Neural Network Training With RLP Loss}
\SetKwInput{KwInput}{Input}
\SetKwInput{KwOutput}{Output}
\label{alg:nnrlp}
\DontPrintSemicolon

\KwInput{$J$ (Training dataset), $\theta$ (Initial NN parameters), $\alpha$ (Learning rate), $M$ (Batch size), $K$ (Number of batches to generate), $E$ (Number of epochs)}
\KwOutput{$\theta$ (Trained NN parameters)}

% Initialization
$\mathcal{B} \gets$ \texttt{Balanced\_Batch\_Generator}($J$, $M$, $K$) 
%\tcp*{Generate batches}

\For{$\text{epoch} = 1,2,\ldots,E$}{
    \For{$j = 1,2,\ldots,K$}{
        $\textbf{x}_j \gets$ Matrix of features from batch $\mathcal{B}[j]$ \\ 
        $\textbf{y}_j \gets$ Vector of labels from batch $\mathcal{B}[j]$ \\
        % compute the observed regression matrix
        $M_y \gets \left(\textbf{x}_j^{\top}\textbf{x}_j\right)^{-1} \textbf{x}_j^{\top} \textbf{y}_j$\\
        $M_h \gets \left(\textbf{x}_j^{\top}\textbf{x}_j\right)^{-1} \textbf{x}_j^{\top} \textbf{h}_{\theta}(\textbf{x}_j)$\\
        Randomly sample $x_j$ (feature vector) from $J$\\
        % Compute RLP Loss
        $l_j(\theta) \gets  \left(\left(M_y - M_h\right)^{\top} x_j\right)^2$ \\ %\tcp*{Compute RLP loss}
        }
        % Gradient Descent Update
    $\mathcal{L}(\theta) \gets \frac{1}{K} \sum_{j=1}^{K} l_j(\theta)$ \\
    $\theta \gets \theta - \alpha \nabla_{\theta} \mathcal{L}(\theta)$ %\tcp*{Update parameters}    
}
\Return{$\theta$} %\tcp*{Return trained parameters}
\end{algorithm}

The above algorithm, Algorithm \ref{alg:nnrlp}, provides a systematic procedure for training a neural network with RLP loss. By iterating through each epoch, and for each batch within this epoch, we compute the observed regression matrix, calculate the RLP loss, and then update the model parameters using gradient descent. This iterative process continues for a predefined number of epochs, ensuring that the model converges to a solution that minimizes the RLP loss.


\section{Empirical Results}
\label{SecEmpiricalResults}

In this section, we present our empirical results for regression, image reconstruction, and classification tasks, using a variety of synthetic and benchmark datasets. We first present the regression results on two benchmark datasets (California Housing ~\citep{geron2022hands} and Wine Quality~\citep{cortez2009modeling}), as well as two synthetic datasets: one \textit{Linear} dataset where the true function is a linear combination of the features in the dataset, and one \textit{Nonlinear} dataset, where the true function combines polynomial terms with trigonometric functions of the features in the dataset. For the image reconstruction tasks, we utilize two different datasets: MNIST \citep{deng2012mnist} and CIFAR10~\citep{krizhevsky2009learning}. We also present the classification results on MNIST (for classification results on the Moons dataset, see Section \ref{secClassification} of the Appendix). A comprehensive description of these datasets is provided in Section \ref{secDatasets} of the Appendix. 

For the evaluations that follow, our default setup utilizes $|J| = 0.5|\mathcal{X}|$ training and $|G| = |\mathcal{X}| - |J|$ test examples, where $|\mathcal{X}|$ signifies the size of each dataset --- we do not consider any distribution shift or additive noise. To ensure a fair comparison, we also use the same learning rate for each loss and network architecture across all experiments on a given dataset. Deviations from this configuration are explicitly mentioned in the subsequent analysis. We first present the performance results when the neural network is trained with RLP loss, MSE loss, and MSE loss with $L_2$ regularization for regression and reconstruction tasks, and with RLP loss and Cross Entropy loss for classification tasks. For the RLP loss case, the neural network is trained using $K = 1000$ batches (see Algorithm \ref{alg:bbg}). Moreover, we present ablation studies on the impact of three different factors: 
\begin{enumerate}
\item[\textbf{(1)}] The number of training examples $|J| \in \{50,100\}$.
\item[\textbf{(2)}] The distribution shift bias $\gamma \in \{0.1,0.2,...,0.9\}$. 
\item[\textbf{(3)}] The noise scaling factor $\beta \in \{0.1,0.2,...,0.9\}$ for the additive standard normal noise.
\end{enumerate}
In \textbf{(1)}, the neural network is trained using $K = 100$ batches for the RLP case, and in \textbf{(2)} and \textbf{(3)}, the neural network is trained using $K = 1000$ batches for the RLP case, produced via Algorithm \ref{alg:bbg}. Our empirical findings demonstrate that the proposed loss helps mitigate the vulnerability of neural networks to these issues. 

\subsection{Performance Analysis}

This first evaluation provides an in-depth assessment of our methods across various benchmark and synthetic datasets, illuminating the efficacy of RLP loss compared to MSE loss, its variant with $L_2$ regularization, and Cross Entropy loss, when there are no ablations introduced within the data. 
\begin{table*}[h!]
\caption{Test performance across different datasets for $|J| = 0.5|\mathcal{X}|$ training examples and $|\mathcal{X}| - |J|$ test examples.}
\label{performance-table1}
\begin{center}
\begin{tabular}{l|cc|cc|cc}
\hline 
& \multicolumn{2}{c|}{\textbf{\texttt{MSE}}} & \multicolumn{2}{c|}{\textbf{\texttt{MSE with}} $\boldsymbol{L_2}$} & \multicolumn{2}{c}{\textbf{\texttt{RLP}}} \\
\textbf{Dataset} & \textbf{Perf. (MSE)} & \textbf{Perf. (RLP)} & \textbf{Perf. (MSE)} & \textbf{Perf. (RLP)} & \textbf{Perf. (MSE)} & \textbf{Perf. (RLP)} \\
\hline
California Housing & $0.915 \scriptstyle \pm 0.997$ & $0.101 \scriptstyle \pm 0.127$ & $0.961 \scriptstyle \pm 1.151$ & $0.106 \scriptstyle \pm 0.102$ & $0.575 \scriptstyle \pm 0.314$ & $0.016 \scriptstyle \pm 0.012$ \\
Wine Quality & $0.542 \scriptstyle \pm 0.014$ & $0.194 \scriptstyle \pm 0.386$ & $0.546 \scriptstyle \pm 0.015$ & $0.169 \scriptstyle \pm 0.089$ & $0.532 \scriptstyle \pm 0.011$ & $0.031 \scriptstyle \pm 0.015$ \\
Linear & $0.227 \scriptstyle \pm 0.104$ & $0.111 \scriptstyle \pm 0.059$ & $0.209 \scriptstyle \pm 0.086$ & $0.087 \scriptstyle \pm 0.065$ & $2.6\text{e-}6 \scriptstyle \pm 1.7\text{e-}6$ & $5.2\text{e-}8 \scriptstyle \pm 9.2\text{e-}8$ \\
Nonlinear & $0.075 \scriptstyle \pm 0.009$ & $0.008 \scriptstyle \pm 0.004$ & $0.073 \scriptstyle \pm 0.006$ & $0.008 \scriptstyle \pm 0.005$ & $0.033 \scriptstyle \pm 0.012$ & $0.002 \scriptstyle \pm 0.001$ \\
MNIST & $0.042 \scriptstyle \pm 0.001$ & $0.047 \scriptstyle \pm 0.001$ & $0.052 \scriptstyle \pm 0.011$ & $0.049 \scriptstyle \pm 0.001$ & $0.018 \scriptstyle \pm 0.002$ & $4.7\text{e-}3\scriptstyle \pm 5.0\text{e-}4$ \\
CIFAR-10 & $6.0\text{e-}4 \scriptstyle \pm 1.0\text{e-}4$ & $6.1 \text{e-}4\scriptstyle \pm 9.8\text{e-}6$ & $1.8\text{e-}3 \scriptstyle \pm 1.0\text{e-}4$ & $1.8\text{e-}3 \scriptstyle \pm 7.1\text{e-}6$ & $2.7\text{e-}5\scriptstyle \pm 1.0\text{e-}6$ & $2.7\text{e-}5 \scriptstyle \pm 1.0\text{e-}6$ \\
\hline
\end{tabular}
\end{center}
\end{table*}


% Training/Test on Half of Dataset
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Epochs.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Epochs.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Linear_vs_Epochs.png}
        \caption{Linear}
    \end{subfigure}
\\
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Epochs.png}
        \caption{Nonlinear}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/figure_mnist_performance.png}
        \caption{MNIST}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/results_cifar10.png}
        \caption{CIFAR-10}
    \end{subfigure}
    \caption{Test performance comparison across six datasets (California Housing, Wine Quality, Linear, Nonlinear, MNIST, and CIFAR-10) using three different loss functions: Mean Squared Error (MSE), MSE with $L_2$ regularization (MSE + $L_2$), and RLP. The x-axis represents training epochs, while the y-axis indicates the test MSE.}
    
\end{figure*}

% Distribution shift
\begin{figure*}[t]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Shift.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Shift.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Shift.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Distribution shift test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using three different loss functions: Mean Squared Error (MSE), MSE with $L_2$ regularization (MSE + $L_2$), and RLP. The x-axis is the degree of bias, $\gamma$, between the test data and the train data, while the y-axis indicates the test MSE.}
\end{figure*}

\paragraph{Regression Task Results.} 
For the California Housing dataset, a benchmarking dataset for regression tasks, we observe several differences in performance. In particular, when we leverage the Adam optimizer \citep{kingma2014adam}, the regression neural network trained with RLP loss demonstrates enhanced efficacy. This contrasts with the case where the regression neural network is trained with MSE loss and its $L_2$ regularized counterpart. Notably, RLP loss not only exhibits superior performance, since the test error is lower when measured using MSE or RLP, but also demonstrates resilience against overfitting. We observe that after $100$ training epochs, MSE loss and its $L_2$ regularized counterpart begin to overfit the training data, resulting in diminished generalization, whereas RLP loss continues to minimize the test error. We further observe that the standard deviation of the test error (compiled after $500$ training epochs) is demonstrably lower when the regression neural network is trained with RLP loss as opposed to MSE loss or MSE loss with $L_2$ regularization.

Subsequently, for the Wine Quality dataset, which has features derived from physicochemical tests assessing wine constituents and their influence on quality, we discern several performance differences. When using the Adam optimizer, the regression neural network trained with RLP loss outperforms those trained with MSE loss and $L_2$ regularized MSE loss. RLP loss not only showcases improved performance metrics --- evidenced by a reduced test error when assessed by either MSE or RLP --- but also demonstrates more rapid convergence. In particular, within just $20$ training epochs, we observe a test MSE of $0.6$ in the RLP loss case. In contrast, both the MSE loss and MSE loss + $L_2$ regularization cases only achieve this test MSE after $200$ epochs. Moreover, we find that the standard deviation of the test error, gathered after $200$ training epochs, is lower when the regression neural network is trained using RLP loss as opposed to MSE loss or its $L_2$ regularized variant.

Delving into the synthetic datasets, we consider the aforementioned two scenarios: a Linear dataset, where the true function is a linear combination of its features, and a Nonlinear dataset, where the true function blends polynomial and trigonometric functions of its features. Using the Adam optimizer, the regression neural network trained with RLP loss exhibits a notably improved performance trajectory in both synthetic scenarios, outpacing networks trained with MSE loss and $L_2$ regularized MSE loss. In the context of the Linear dataset, the efficacy of RLP loss is particularly pronounced, as it converges to a test MSE below $3 \times 10^{-6}$ within $200$ training epochs. In contrast, both MSE loss and its $L_2$ regularized version yield a test MSE above $0.2$ at the same epoch count. As it pertains to the Nonlinear dataset, RLP loss similarly yields a lower test error in comparison to MSE loss and its $L_2$ regularized counterpart. Cumulatively, these results demonstrate that RLP loss yields improved performance over MSE loss and MSE loss with $L_2$ regularization, even when the true function has nonlinearities.

\paragraph{Image Reconstruction Task Results.}

With MNIST, a benchmark dataset in image reconstruction tasks, our findings are in line with previous observations. Leveraging the SGD optimizer with Nesterov momentum \citep{nesterov1983method}, we observe that the autoencoder trained using RLP loss yields a test MSE of $0.018$ after $100$ training epochs, whereas the autoencoder trained using MSE loss or $L_2$ regularized MSE loss yields a test MSE above $0.04$ after $100$ training epochs. When the test error is instead measured using RLP, the gains provided by RLP loss over MSE and MSE + $L_2$ regularization become even more apparent. As in before, we also observe that the standard deviation of the test error, gathered after $100$ training epochs, is lower when the autoencoder is trained using RLP loss instead of MSE loss or its $L_2$ regularized variant.

Our experiments on CIFAR-10 corroborate our earlier findings from the MNIST experiments. Utilizing the SGD optimizer with Nesterov momentum, we observe a test MSE of $2.7 \times 10^{-5}$ when the autoencoder is trained with RLP loss for $50$ epochs. In contrast, we observe a test MSE exceeding $5.0 \times 10^{-4}$ when the autoencoder is trained with MSE loss or $L_2$ regularized MSE loss for $50$ epochs. We also observe a reduction in the standard deviation of the test error after $50$ epochs when the autoencoder is trained using RLP loss versus MSE loss and MSE loss with $L_2$ regularization.

\paragraph{Classification Task Results.}
Per Section \ref{secClassification} of the Appendix, RLP loss can also be applied to classification tasks. We consider the MNIST dataset for our experiments. Using the AdamW optimizer \cite{loshchilov2017decoupled}, we observe that the convolutional neural network (CNN) converges to a test accuracy of $96\%$ after $10$ epochs using RLP loss. In contrast, we observe a test accuracy of $86\%$ when the CNN is trained with Cross Entropy loss after $10$ epochs. This evaluation demonstrates that the faster convergence yielded by RLP loss is preserved in classification scenarios.
\begin{figure}[t!]
    \centering
    \includegraphics[width=0.48\textwidth]{figures/MNIST_Training_Size_5000.png}
    \caption{Test performance comparison on MNIST using Cross Entropy loss and RLP loss. The x-axis represents training epochs, while the y-axis indicates the classification accuracy (left) and F1 score (right).}
    \label{mnist}
\end{figure}

\begin{figure}[h!]
    \centering
    % First Row: MSE Images
    \begin{subfigure}{0.13\textwidth}
        \includegraphics[width=\linewidth]{figures/mnist_mse_50_5.png}
        \caption{Epoch 5}
    \end{subfigure}
    \begin{subfigure}{0.13\textwidth}
        \includegraphics[width=\linewidth]{figures/mnist_mse_50_10.png}
        \caption{Epoch 10}
    \end{subfigure}
    \begin{subfigure}{0.13\textwidth}
        \includegraphics[width=\linewidth]{figures/mnist_mse_50_50.png}
        \caption{Epoch 50}
    \end{subfigure}
    % Second Row: RLP Images
    \begin{subfigure}{0.13\textwidth}
        \includegraphics[width=\linewidth]{figures/mnist_rlp_50_5.png}
        \caption{Epoch 5}
    \end{subfigure}
    \begin{subfigure}{0.13\textwidth}
        \includegraphics[width=\linewidth]{figures/mnist_rlp_50_10.png}
        \caption{Epoch 10}
    \end{subfigure}
    \begin{subfigure}{0.13\textwidth}
        \includegraphics[width=\linewidth]{figures/mnist_rlp_50_50.png}
        \caption{Epoch 50}
    \end{subfigure}
    
    \caption{Comparison of reconstructed images for an autoencoder trained with MSE loss (top row) and RLP loss (bottom row) at different epochs. The model trained with RLP loss learns faster and better with limited data ($|J| = 50$).}
    \label{fig:comparison}
\end{figure}


\subsection{Ablation Studies}

This next evaluation delves into the performance dynamics of our methods under ablated data scenarios, highlighting the resilience of RLP loss relative to MSE loss and MSE loss with $L_2$ regularization in the presence of data perturbations.

\paragraph{Number of Training Examples.}

\begin{table}[t!]
\caption{Test Performance for $|J| = 50$ training examples.}
\label{performance-table-50}
\begin{center}
\begin{tabular}{l|c|c|c}
\hline 
\textbf{Dataset} & \textbf{\texttt{MSE}} & \textbf{\texttt{MSE}}+$\boldsymbol{L_2}$ & \textbf{\texttt{RLP}} \\
\hline
Cali. Housing & $4.09 \scriptstyle \pm 3.00$ & $4.42 \scriptstyle \pm 3.44$ & $3.04 \scriptstyle \pm 1.87$ \\
Wine Quality & $1.16 \scriptstyle \pm 0.26$ & $1.31 \scriptstyle \pm 0.47$ & $1.15 \scriptstyle \pm 0.14$ \\
Linear & $0.86 \scriptstyle \pm 0.19$ & $0.84 \scriptstyle \pm 0.20$ & $5.0\text{e-}4 \scriptstyle \pm 7.0\text{e-}4$ \\
Nonlinear & $0.13 \scriptstyle \pm 0.02$ & $0.13 \scriptstyle \pm 0.03$ & $0.09 \scriptstyle \pm 0.03$ \\
MNIST & $0.23 \scriptstyle \pm 0.01$ & $0.23 \scriptstyle \pm 0.01$ & $0.05 \scriptstyle \pm 0.01$ \\
CIFAR-10 & $0.25 \scriptstyle \pm 0.01$ & $0.25 \scriptstyle \pm 0.01$ & $5.6\text{e-}4 \scriptstyle \pm 1.2\text{e-}5$ \\
\hline
\end{tabular}
\end{center}
\end{table}

% 50 Training Data Points
\begin{figure}[t!]
    \centering
    \begin{subfigure}{0.23\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Training_Size_50.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.23\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Training_Size_50.png}
        \caption{Wine Quality}
    \end{subfigure}
    \caption{Limited training data ($|J| = 50$) test performance comparison across two datasets (California Housing and Wine Quality) using three different loss functions: Mean Squared Error (MSE), MSE with $L_2$ regularization (MSE + $L_2$), and RLP. The x-axis represents training epochs, while the y-axis indicates the test MSE.}
    \label{fig:50_training_points}
\end{figure}

\begin{figure*}[t!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Noise.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Noise.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Noise.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Noise robustness test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using three different loss functions: Mean Squared Error (MSE), MSE with $L_2$ regularization (MSE + $L_2$), and RLP. The x-axis is the scaling factor, $\beta$, for the additive standard normal noise, while the y-axis indicates the test MSE.}
\end{figure*}

In the ablation study with a constraint of $|J| = 50$ training examples, our findings across the six datasets underline the robustness and efficacy of the RLP loss. For the California Housing and Wine Quality regression benchmark datasets, the RLP loss-trained models consistently outperform their MSE loss and $L_2$ regularized MSE loss-trained counterparts in both convergence rate and test error, despite the limited data. 
% In particular, we observe that neural networks trained with RLP loss observe accelerated \textit{grokking} \citep{power2022grokking}, where the model begins to generalize past the point of overfitting (see Figure \ref{fig:50_training_points}). 
This trend persists in the Linear and Nonlinear synthetic regression datasets, with RLP loss-trained neural networks achieving rapid convergence and low error. Similarly, for the image reconstruction tasks on MNIST and CIFAR-10, RLP loss-trained models achieve faster convergence. Figure \ref{fig:comparison} in particular shows that after $5$, $10$, and $50$ training epochs, the MNIST images reconstructed by the RLP loss-trained autoencoder are more accurate and clearer than those generated by the MSE loss-trained autoencoder. Results for the case of $|J| = 100$ training data points can be found in Section \ref{secRegReconstruct} of the Appendix.

\paragraph{Distribution Shift Bias.}

In this ablation study, we consider the case of a distribution shift between the train and test data, characterized by a bias parameter, $\gamma$. Given a dataset $\mathcal{X}$ consisting of $d$-dimensional feature vectors, $x_i$, let $\boldsymbol{\mu}$ be the mean vector of $\mathcal{X}$ and $\boldsymbol{\sigma}$ be the standard deviation vector. Regarding preliminaries, we introduce a notation for element-wise comparison of vectors: for two vectors $\mathbf{a}, \mathbf{b} \in \mathbb{R}^d$, we write $\mathbf{a} \prec \mathbf{b}$ to denote that $a_j < b_j$ for all $j \in \{1,2,\ldots,d\}$. Using this notation, we define the region of interest (ROI) in the feature space via two conditions that must hold simultaneously: $x_i - \boldsymbol{\mu} \prec \boldsymbol{\epsilon}$ and $\boldsymbol{\mu} - x_i \prec \boldsymbol{\epsilon}$, where $\boldsymbol{\epsilon} = 0.5 \times \boldsymbol{\sigma}$. Per these definitions:

\textbf{(1)} For examples within the ROI (close to the mean):
\begin{equation*}
\mathbb{P}[x_i \in J \ | \ (x_i - \boldsymbol{\mu} \prec \boldsymbol{\epsilon}) \ \ \text{and} \ \ (\boldsymbol{\mu} - x_i \prec \boldsymbol{\epsilon})] = \gamma
\end{equation*}
\textbf{(2)} For examples outside the ROI (far from the mean):
\begin{equation*}
\mathbb{P}[x_i \in J \ | \ (x_i - \boldsymbol{\mu} \nprec \boldsymbol{\epsilon}) \ \ \text{or} \ \ (\boldsymbol{\mu} - x_i \nprec \boldsymbol{\epsilon})] = 1 - \gamma
\end{equation*}
Thus, data examples that are closer to the mean are more likely to be included in the training dataset if $\gamma > 0.5$ and in the test dataset otherwise (and vice versa). By varying the bias parameter, $\gamma$, which modulates the distribution shift between the training and test data, we discern a consistent trend favoring the RLP loss across the California Housing, Wine Quality, and Nonlinear datasets. The Nonlinear dataset in particular illustrates that regardless of the selected distribution shift bias, neural networks employing RLP loss invariably outperform, in terms of test MSE, those anchored by MSE loss or its $L_2$ regularized counterpart. These findings emphasize the robustness of RLP loss in the face of distributional disparities between training and test data.

\paragraph{Noise Scaling Factor.}
Given the training dataset, $J$, the objective of this ablation study is to examine the impact of additive Gaussian noise on the performance of RLP loss-trained, MSE loss-trained, and $L_2$ regularized MSE loss-trained models. Specifically, we add standard normal noise scaled by a factor, $\beta$, to each example $x_i \in J$, where $i \in \{1,2,...,N\}$. The modified training dataset, $J'$, is denoted as $J' = \{(x_i',y_i)\}_{i=1}^N$, where:
\begin{equation*}
x_i' = x_i + \beta \times \mathcal{N}(\mathbf{0}, \mathbf{I}_d)
\end{equation*}
This experimental setup allows us to gauge how the signal-to-noise ratio (SNR) influences the efficacy of our regression neural network when it is trained using RLP loss, conventional MSE loss, or MSE loss with $L_2$ regularization.

We now evaluate the robustness of the RLP loss under different noise intensities by varying the noise scaling factor, $\beta$. Across the California Housing, Wine Quality, and Nonlinear datasets, for all tested values of $\beta$, the neural network trained using RLP loss consistently achieves a lower test MSE compared to those trained with MSE loss and MSE loss with $L_2$ regularization. Furthermore, as $\beta$ is increased — implying increased noise in the training data — we observe that the RLP loss-trained neural network displays more pronounced asymptotic behavior in the test MSE relative to its counterparts trained with MSE loss and MSE loss with $L_2$ regularization. This behavior indicates that RLP loss not only mitigates the detrimental effects of additive noise but also adapts more effectively to its presence, highlighting its robustness under such data perturbations.

%\SH{All Figures will be updated (early results shown for benchmark datasets - will be much more comprehensive)}
% \subsection{Benchmark Datasets}
% We first evaluated the efficacy of the RLP loss on several benchmark datasets that are ubiquitously employed in the machine learning community. These datasets, with their diverse characteristics and challenges, serve as a rigorous testbed for any new methodology.

% \subsubsection{California Housing Dataset}
% In our first evaluation, we outline the efficacy of using our random linear projections loss to train a neural network on the California Housing Benchmark Dataset. For the random linear projections loss case, our neural network is trained using $K \in \{10^2, 10^3, 10^4\}$ `batches' that have been randomly sampled from all possible batches. The size of each batch is $n = 100$. We compare our RLP loss with MSE and MSE with $L_2$ regularization.

% \begin{figure}[h!]
%     \centering
%     \captionsetup{justification=centering}
%     \includegraphics[width=1\linewidth]{California_Housing_Test_Loss_Number_of_Batches.png}
%     \caption{Random Linear Projections Loss vs. Mean Squared Error on California Housing Dataset}
%     \label{California_Housing}
% \end{figure}

% As shown by Figure \ref{MNIST_Autoencoder}, we observe that training our regression neural network using the random linear projections loss achieves better performance on the test examples over the more traditional MSE-training approaches. This empirical evaluation supports the theoretical performance guarantees of the random linear projections loss.

% \subsubsection{MNIST Autoencoder Case}
% In our second evaluation, we outline the efficacy of using our random linear projections loss when training an autoencoder to reconstruct MNIST digits. For the random linear projections loss case, our neural network is trained using $10^3$ `batches' that have been randomly sampled from all possible batches. The size of each batch is $n = 10^4$. We compare this random linear projections loss with mean squared error (MSE), and MSE with $L_2$ regularization on the test dataset.
% \begin{figure}[h!]
%     \centering
%     \captionsetup{justification=centering}
%     \includegraphics[width=0.8\linewidth]{MNIST_Autoencoder_All_Lines.png}
%     \caption{Random Linear Projections Loss vs. Mean Squared Error for MNIST Digit Reconstruction}
%     \label{MNIST_Autoencoder}
% \end{figure}

% As shown by Figure \ref{MNIST_Autoencoder}, we observe that training our regression neural network using the random linear projections loss achieves better performance on the test examples over the more traditional MSE-training approaches. This empirical evaluation supports the theoretical performance guarantees of the random linear projections loss.

% \subsection{Synthetic Datasets}
% To further validate our claims and understand the behavior of RLP loss in controlled environments, we generated synthetic datasets where the ground truth patterns and noise levels could be modulated.

\section{Conclusion}
\label{SecConclusion}
In this work, we presented a new loss function called RLP loss, tailored for capturing non-local linear properties in observed datasets. We provided a mathematical analysis outlining relevant properties of RLP loss, and extended this analysis via rigorous empirical testing on benchmark and synthetic datasets for regression, reconstruction, and classification tasks. 
%We observed that neural networks trained with the proposed loss are more robust to distribution shifts between the training and test data, more resilient to additive noise, and generalize better in the presence of limited training data. 
It is important to note that training neural networks with RLP loss involves inverting matrices during each training epoch, which is computationally expensive. Optimizing this training process and proving further statistical properties of RLP loss is an open research problem. We consider this work to be a milestone in designing loss functions that capture non-local geometric properties verified by observed datasets.

\newpage
\acknowledgements{Shyam Venkatasubramanian and Vahid Tarokh were supported in part by the Air Force Office of Scientific Research (AFOSR) under award FA9550-21-1-0235. Ahmed Aloui was supported in part by the National Science Foundation (NSF) under the National AI Institute for Edge Computing Leveraging Next Generation Wireless Networks Grant \# 2112562. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the U.S. Department of Defense and the U.S. National Science Foundation.}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\linewidth]{figures/nsf_logo.png}
\end{figure}

\bibliography{uai2024-template}    

\clearpage
\appendix
\onecolumn
{\Huge \textbf{Appendix}}

\section{PROOFS OF THE THEORETICAL RESULTS} \label{secTheoretical}
In this section, we present the proofs of the theoretical results outlined in the main text. 
\begin{proposition1}
% \label{prop:basic}
Let $h\in \mathcal{H}$ be a hypothesis function. We observe that $\mathcal{L}(h)\geq 0$ with the hypothesis minimizing the loss being $h(x)=\mathbb{E}\left[Y|X=x\right]$ almost surely.
\end{proposition1}
\begin{proof}
Let $\mathbf{X} \in \mathcal{M}_{M,d}(\mathbb{R})$. Firstly, we observe that $\mathcal{L}(h) = \mathbb{E}\Biggl[\biggl\| \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\left(\mathbf{Y} - \mathbf{h}(\mathbf{X})\right)\right)^{\top} X\biggl\|^2  \Biggr]$ is the expectation of a non-negative random variable. Accordingly, the expectation is non-negative, and therefore, $\mathcal{L}(h)\geqslant 0$. 
    
We start by proving the first implication. Firstly, we suppose that $h(x)=\mathbb{E}\left[Y|X=x\right]$ almost surely. Then, the extension $\mathbf{h}\left(X_1,\ldots,X_M\right) =  \begin{pmatrix} \mathbb{E}\left[Y|X_1\right],\ldots,\mathbb{E}\left[Y|X_M\right]
\end{pmatrix}^{\top}$. Therefore, 
$$
\begin{aligned}
\mathcal{L}(h) & = \mathbb{E}\Biggl[\biggl\| \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\left(\mathbf{Y} - \mathbf{h}(\mathbf{X})\right)\right)^{\top} X\biggl\|^2  \Biggr]\\
 & = \mathbb{E}\Biggl[\mathbb{E}\Biggl[\biggl\| \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\left(\mathbf{Y} - \mathbf{h}(\mathbf{X})\right)\right)^{\top} X\biggl\|^2  \Bigg|\mathbf{X}\Biggr]\Biggr]\quad \text{(by the law of total expectation)}\\
\end{aligned}
$$
Let $Z = \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\left(\mathbf{Y} - \mathbf{h}(\mathbf{X})\right)\right)$, where $Z\in \mathbb{R}^d$. Furthermore, let $X = \begin{pmatrix}
    x_1,\ldots,x_d
\end{pmatrix}$, and $Z = \begin{pmatrix}
    z_1,\ldots,z_d
\end{pmatrix}$. \\ By linearity of conditional expectation, We have that,
$$
\begin{aligned}
\mathcal{L}(h) & = \mathbb{E}\left[\mathbb{E}\left[\left(\sum_{i=1}^{d}z_ix_i\right)^2\bigg|\mathbf{X}\right]\right]\\
& = \mathbb{E}\left[\mathbb{E}\left[\sum_{i=1}^{d}z_i^2x_i^2+ 2\; \sum_{1\leq i<j\leq d}z_i z_j x_i x_j\bigg|\mathbf{X}\right]\right] \\
& = \mathbb{E}\left[\mathbb{E}\left[\sum_{i=1}^{d}z_i^2x_i^2\bigg|\mathbf{X}\right]+ 2 \;\mathbb{E}\left[ \sum_{1\leq i<j\leq d}z_i z_j x_i x_j\bigg|\mathbf{X}\right]\right] \\
& = \mathbb{E}\left[\sum_{i=1}^{d}\mathbb{E}\left[z_i^2x_i^2\bigg|\mathbf{X}\right]+ 2 \;\sum_{1\leq i<j\leq d}\mathbb{E}\left[z_i z_j x_i x_j\bigg|\mathbf{X}\right]\right] \\
& = \mathbb{E}\left[\sum_{i=1}^{d}\mathbb{E}\left[z_i^2\big|\mathbf{X}\right] \mathbb{E}\left[x_i^2\right]+ 2 \;\sum_{1\leq i<j\leq d}\mathbb{E}\left[z_i\big|\mathbf{X}\right]\mathbb{E}\left[z_j\big|\mathbf{X}\right]\mathbb{E}\left[x_i\right]\mathbb{E}\left[x_j\right]\right]\\
\end{aligned}
$$
The last equation follows from the following independence conditions: $X\ind \mathbf{X}$, $Z\ind X |\mathbf{X}$, and $z_i \ind z_j |\mathbf{X}$.

We now prove that for every $i \in \{1,\ldots,d\}$, $\mathbb{E}[z_i^2|\mathbf{X}] = 0$ and $\mathbb{E}[z_i|\mathbf{X}] = 0$. \\
Let $\mathbf{A} = \left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top}$, where $\mathbf{A} = \begin{pmatrix}
     a_{11} & \ldots & a_{1M}\\
     \vdots & \ddots & \vdots \\
     a_{d1} & \ldots & a_{dM}
\end{pmatrix}$. We have that $z_i = \sum_{k=1}^{M} a_{ik} \left(Y_k - h(X_k)\right)$. 

Therefore, by linearity of the conditional expectation, and since we considered that $h = \mathbb{E}[Y_k|X_k]$, we have $\mathbb{E}[z_i|\mathbf{X}] = 0$. \\
Furthermore, we have that,
$$
\begin{aligned}
    \mathbb{E}[z_i^2|\mathbf{X}] 
    & =  \mathbb{E}\left[\left(\sum_{k=1}^{M} a_{ik} \left(Y_k - h(X_k)\right)\right)^2\bigg|\mathbf{X}\right]\\
    & = \mathbb{E}\left[\sum_{k=1}^{M} a_{ik}^2 \left(Y_k - h(X_k)\right)^2\bigg|\mathbf{X}\right]+ 2 \; \mathbb{E}\left[\sum_{1\leq k<l\leq M} a_{ik} a_{il} \left(Y_k - h(X_k)\right) \left(Y_l - h(X_l)\right)\bigg|\mathbf{X}\right]\\
    & = \sum_{k=1}^{M}\mathbb{E}\left[ a_{ik}^2 \bigg|\mathbf{X}\right] \mathbb{E}\left[\left(Y_k - h(X_k)\right)^2\bigg|\mathbf{X}\right]+ 2 \; \sum_{ k<l} \mathbb{E}\left[a_{ik} a_{il} \bigg|\mathbf{X}\right]\mathbb{E}\left[\left(Y_k - h(X_k)\right)\bigg|\mathbf{X}\right]\mathbb{E}\left[\left(Y_l - h(X_l)\right)\bigg|\mathbf{X}\right]\\
    & = 0 
\end{aligned}
$$
We now prove the second implication, assuming that $\mathcal{L}(h) = 0$. Finding the minimum over the space of functions, $\mathcal{H}$, is equivalent to solving for $\mathbf{h}(x)$ for every $x$. Subsequently, letting $\mathbf{x} \in \mathcal{M}_{M,d}(\mathbb{R})$, we have that,
$$
\mathbf{h}(\mathbf{x}) =  \text{arg}\min_{\mathbf{c}\in \mathbb{R}^M} \mathbb{E}\Biggl[\biggl\|X^{\top} \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\left(\mathbf{Y} - \mathbf{c}\right)\right) \biggl\|^2  \Bigg|\mathbf{X}=\mathbf{x}\Biggr]
$$
By taking the gradient with respect to $\mathbf{c}$, we have that,
$$
\begin{aligned}  \nabla_{\mathbf{c}}\mathbb{E}\Biggl[\biggl\|X^{\top} \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\left(\mathbf{Y} - \mathbf{c}\right)\right) \biggl\|^2  \Bigg|\mathbf{X}=\mathbf{x}\Biggr] 
& = \mathbb{E}\Biggl[\nabla_{\mathbf{c}}\biggl\|X^{\top} \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\left(\mathbf{Y} - \mathbf{c}\right)\right) \biggl\|^2  \Bigg|\mathbf{X}=\mathbf{x}\Biggr] \\
& = \mathbb{E}\Biggl[\left(X^{\top} \mathbf{A}\left(\mathbf{Y} - \mathbf{c}\right)\right) \left(\left(\mathbf{Y} - \mathbf{c}\right) \odot   \mathbf{A}^{\top}X\right)\Bigg|\mathbf{X}=\mathbf{x}\Biggr] 
\end{aligned}
$$
Consequently, if the gradient with respect to $\mathbf{c}$ is zero, it implies that for every $i \in \{1,\ldots,M\}$, we have that,
$$
\begin{aligned}
    \sum_{j=1}^d \sum_{k=1}^M \sum_{l=1}^d\mathbb{E}\left[x_l x_j a_{jk} a_{li} \left(y_k - c_k\right)\left(y_i-c_i\right)\big |\mathbf{X}=\mathbf{x}\right] = 0
\end{aligned}
$$
where $x_i$ is the i$^{th}$ component of $X$ and $a_{jk}$ are the elements of the matrix $\mathbf{A}$. By the independence of $X$, and the fact that $a_{jk}$ is $\mathbf{X}$-measurable, it follows that if the gradient is zero. Accordingly, we have that,
$$
\begin{aligned}
    \sum_{j=1}^d \sum_{k=1}^M \sum_{l=1}^d\mathbb{E}\left[x_l x_j\right] a_{jk} a_{li} \mathbb{E}\left[\left(y_k - c_k\right)\left(y_i-c_i\right)\big |\mathbf{X}=\mathbf{x}\right] = 0
\end{aligned}
$$
Since the rows of $\mathbf{X}$ are independent and identically distributed and since $M>d$, we have that $\mathbf{X}^{\top}\mathbf{X}$ is full rank and invertible, and hence, $\mathbf{A}$ is positive definite. Furthermore, $\mathbb{E}\left[x_l x_j\right]$ are the elements of the covariance matrix of $X$, which is also positive definite. If the gradient is equal to zero, this implies that, for every $i,k \in \{1,\ldots,M\}$,
$$\mathbb{E}\left[\left(y_k - c_k\right)\left(y_i-c_i\right)\big |\mathbf{X}=\mathbf{x}\right] = 0$$
Consequently, for $i=k$,
$$\mathbb{E}\left[\left(y_i-c_i\right)^2\big |\mathbf{X}=\mathbf{x}\right] = 0$$
Hence,
$$
\mathbf{c} = \mathbb{E}\left[\mathbf{Y}\big |\mathbf{X}=\mathbf{x}\right]
$$
Therefore, we see that $\mathcal{L}(h)\geq 0$ with the hypothesis minimizing the loss being $h(x)=\mathbb{E}\left[Y|X=x\right]$ almost surely.
\end{proof}

% \begin{proposition2}
%     % \label{thm:rlp_loss_convex}
%     For a class of hypotheses restricted to linear models (i.e., $h_{a,b}(X) = a^\top X + b$ with $a \in \mathbb{R}^d$ and $b\in \mathbb{R}$), there exists a $\delta > 0$ such that for every $\epsilon < \delta$,
%     $$ \mathcal{L}\left(\theta - \epsilon \nabla_{\theta} \mathcal{L}(\theta)\right) \leqslant \mathcal{L}(\theta)$$
% \end{proposition2}
% \begin{proof}
%     Given the hypothesis:
% \[ h_{\theta}(\mathbf{X}) = a^\top\mathbf{X} + b \]
% where \( \theta = [a^\top, b^\top]^\top \), our RLP loss function is:
% \[ \mathcal{L}(\theta) = \mathbb{E}\left[\left\| \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\left(\mathbf{Y} - a^\top \mathbf{X} - b\right)\right)^{\top} X\right\|^2  \right] \]
% The loss function $\mathcal{L}(\theta)$ is a quadratic function with respect to the parameters $a$ and $b$. Specifically, it can be written in the form:
% $$
% \mathcal{L}(\theta)=\theta^{\top} H \theta+c^{\top} \theta+d
% $$
% where $H$ is the Hessian matrix, $c$ is a vector, and $d$ is a scalar.

% For a quadratic function to be convex, its Hessian matrix, $H$, needs to be positive semidefinite. Given that the matrix $\mathbf{X}^{\top} \mathbf{X}$ is assumed to be full rank (the rows of $\mathbf{X}$ are independent and $M>d$), the Hessian matrix of $\mathcal{L}(\theta)$ with respect to the parameters $a$ and $b$ will be positive semi-definite. Hence, the function $\mathcal{L}(\theta)$ is convex in $a$ and $b$, and as such, there exists a $\delta > 0$ such that for every $\epsilon < \delta$, $\mathcal{L}\left(\theta - \epsilon \nabla_{\theta} \mathcal{L}(\theta)\right) \leqslant \mathcal{L}(\theta)$.

% % Let's compute the gradient with respect to \( \mathbf{a} \) and \( b \):

% % \begin{align*}
% % \nabla_{\mathbf{a}} \mathcal{L}(\theta) &= \mathbb{E}\left[\nabla_{\mathbf{a}} \left\| \left(\mathbf{A}\left(\mathbf{Y} - \mathbf{a}^T\mathbf{X} - b\right)\right)^{\top} X\right\|^2  \right] \\
% % \nabla_b \mathcal{L}(\theta) &= \mathbb{E}\left[\nabla_b \left\| \left(\mathbf{A}\left(\mathbf{Y} - \mathbf{a}^T\mathbf{X} - b\right)\right)^{\top} X\right\|^2  \right]
% % \end{align*}
% % Therefore,
% % \begin{align*}
% % \nabla_{\mathbf{a}} \mathcal{L}(\theta) &= \mathbb{E}\left[ \left(X^{\top}\mathbf{A}\left(\mathbf{Y} - \mathbf{a}^T\mathbf{X} - b\right)\right) \left(X \odot \nabla_{\mathbf{a}} \left(\mathbf{A}\left(\mathbf{Y} - \mathbf{a}^T\mathbf{X} - b\right)\right)\right) \right] \\
% % \nabla_b \mathcal{L}(\theta) &= \mathbb{E}\left[\nabla_b \left\| \left(\mathbf{A}\left(\mathbf{Y} - \mathbf{a}^T\mathbf{X} - b\right)\right)^{\top} X\right\|^2  \right]
% % \end{align*}


% % The Hessian matrix \( H \) is:
% % \[ H = \begin{bmatrix}
% % \nabla^2_{\mathbf{a}} \mathcal{L}(\theta) & \nabla_{\mathbf{a},b} \mathcal{L}(\theta) \\
% % \nabla_{b,\mathbf{a}} \mathcal{L}(\theta) & \nabla^2_b \mathcal{L}(\theta)
% % \end{bmatrix} \]


% \end{proof}
\begin{proposition3}
% \label{thm:better_loss}
Let $L_0$ denote the MSE loss and let $\theta^{*}$ be the optimal parameters (i.e $h_{\theta^*} = \mathbb{E}\left[Y|X\right]$ almost surely). We assume that both the MSE and RLP loss functions are convex. 
Under the following conditions:
\begin{enumerate}[label=(\roman*)]
    \item $\mathbb{E}\left[X_{i} X_{j}\right]= [1,\cdots,1]^{\top}\mathbbm{1}_{i=j}$.
    \item $(\mathbf{Y}-h_{\theta}(\mathbf{X})) \leqslant 0 $ and $\nabla_{\theta} h_{\theta}(\mathbf{X}) \leqslant 0 \quad$ (component-wise inequality).
    \item For every $j,k \in \{1,2,\ldots,d\}$ and for every $l \in \{1,2,\ldots,M\},\; \mathbb{E}\left[\mathbf{a}_{j k} \mathbf{a}_{k l}\right] \geqslant \frac{1}{d^{2}}$, where
    $\left(\mathbf{a}_{jk}\right)$ and $\left(\mathbf{a}_{kl}\right)$ are the components of $\mathbf{A}=\left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top}$.
\end{enumerate}
We observe that for every step size $\epsilon \geqslant 0$ and parameter $\theta \in \mathbb{R}^W$ for which gradient descent converges,
$$\; \left\| \theta^{*} - \left(\theta - \epsilon \nabla_{\theta}\mathcal{L}(\theta)\right) \right\| \leqslant \left\|\theta^{*}  - \left(\theta - \epsilon\nabla_{\theta}L_0(\theta)\right) \right\|$$
\end{proposition3}
This proposition contrasts the convergence behavior of the two loss functions, MSE loss and RLP loss, for gradient descent optimization in parameterized models. It asserts that under certain conditions --- (\textit{i}), (\textit{ii}), and (\textit{iii}) from Proposition \ref{thm:better_loss} --- updates based on the gradient of the RLP loss function bring the parameters closer to the optimal solution than those based on the gradient of the MSE loss function.
\begin{proof}
Under the following assumptions:
\textit{(i)} $\mathbb{E}\left[X_{i} X_{j}\right]= [1,\cdots,1]^{\top}\mathbbm{1}_{i=j}$.

\textit{(ii)} $(\mathbf{Y}-h_{\theta}(\mathbf{X})) \leqslant 0 $ and $\nabla_{\theta} h_{\theta}(\mathbf{X}) \leqslant 0 \quad$ (component-wise inequality)

\textit{(iii)} $E\left[\mathbf{a}_{j k} \mathbf{a}_{l k}\right] \geqslant \frac{1}{d^{2}} \forall j,k,l $, where
$\left(\mathbf{a}_{i m}\right)_{1\leq i \leq d, 1\leq m \leq M}$ are the components of the matrix $\mathbf{A}=\left(\mathbf{X}^{\top} \mathbf{X}\right)^{-1} \mathbf{X}^{\top}$

We have that,
$$
\left\|\theta^{*}-\theta+\varepsilon \nabla_{\theta} \mathcal{L}\left(\theta\right)\right\|_2^{2} =\sum_{i=1}^{K}\left(\theta_{i}^{*} - \theta_i +\epsilon \frac{\partial}{\partial \theta_i} \mathcal{L}\left(\theta\right)\right)^{2}
$$
Letting $1 \leqslant i \leqslant W$, we have that,
$$
\begin{aligned}
\frac{\partial}{\partial \theta_i} \mathcal{L}(\theta) 
& = - 2 \; \mathbb{E}\Bigg[X_{n+1}^{\top} \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} \mathbf{X}^{\top} \left(\mathbf{Y} - h(\mathbf{X})\right)\right) \times X_{n+1}^{\top} \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} \mathbf{X}^{\top} \frac{\partial}{\partial \theta_i}h(\mathbf{X})\right) \Bigg] \\
& = - 2 \; \mathbb{E}\left[X_{n+1}^{2}\right]^{\top} \mathbb{E}\Bigg[ \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} \mathbf{X}^{\top} \left(\mathbf{Y} - h(\mathbf{X})\right)\right) \odot \left(\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} \mathbf{X}^{\top} \frac{\partial}{\partial \theta_i}h(\mathbf{X})\right) \Bigg]
\end{aligned}
$$
Where $\odot$ denotes the Hadamard product between two vectors. Note that $ \frac{1}{2} \frac{\partial}{\partial \theta_i} \mathcal{L}(\theta) \leqslant 0$. Subsequently, it follows from assumption \textit{(i)} that,
$$
\begin{aligned}
- \frac{1}{2} \frac{\partial}{\partial \theta_i} \mathcal{L}(\theta) 
& =  \mathbb{E}\left[\left(\mathbf{Y} - h(\mathbf{X})\right)^{\top} \mathbf{A}^{\top} \mathbf{A} \frac{\partial}{\partial \theta_i} h_{\theta}(\mathbf{X})\right] \\
& = \sum_{j=1}^M\sum_{l=1}^M\sum_{k=1}^d \mathbb{E}\left[\left(Y_j - h(X_j)\right)\mathbf{a}_{j k} \mathbf{a}_{lk} \frac{\partial}{\partial \theta_i}h(X_l)\right] \\
& = \sum_{j=1}^M \sum_{l=1}^M \sum_{k=1}^d \mathbb{E}\left[\mathbb{E}\left[\left(Y_j - h(X_j)\right)\mathbf{a}_{j k} \mathbf{a}_{l k} \frac{\partial}{\partial \theta_i}h(X_l)\right] \bigg| X_j,X_l\right] \\
& = \sum_{k=1}^d\sum_{j\neq l}^M \mathbb{E}\left[\left(Y_j - h(X_j)\right)\right] \mathbb{E} \left[\frac{\partial}{\partial \theta_i}h(X_l)\right]\mathbb{E}\left[\mathbf{a}_{j k} \mathbf{a}_{l k} \right] \\
& \;\; + \sum_{k=1}^d\sum_{j=1}^M \mathbb{E}\left[\left(Y_j - h(X_j)\right)\frac{\partial}{\partial \theta_i}h(X_j)\right] \mathbb{E}\left[\mathbf{a}_{j k} \mathbf{a}_{j k} \right] \\
& \geqslant \frac{M}{d} \; \mathbb{E}\left[\left(Y_1 - h(X_1)\right) \frac{\partial}{\partial \theta_i}h(X_1)\right]\\
& \geqslant - \frac{1}{2}\frac{\partial}{\partial \theta_i} L_0(\theta) 
\end{aligned}
$$
This result follows from the application of the tower property, noting that for $j\neq l$, we have that $Y_j \ind Y_l | X_j,X_l$ and $h(X_j) \ind h(X)_l | X_j,X_l$, and by applying assumption \textit{(ii)} and \textit{(iii)}. Therefore we have that,
$$
\left(\theta_{i}^{*} - \theta_i +\epsilon \frac{\partial}{\partial \theta_i} \mathcal{L}\left(\theta\right)\right)^2 \leqslant \left(\theta_{i}^{*} - \theta_i +\epsilon \frac{\partial}{\partial \theta_i} L_0\left(\theta\right)\right)^2
$$
Accordingly, we observe that $\; \left\| \theta^{*} - \left(\theta - \epsilon \nabla_{\theta}\mathcal{L}(\theta)\right) \right\| \leqslant \left\|\theta^{*}  - \left(\theta - \epsilon\nabla_{\theta}L_0(\theta)\right) \right\|$ for every step size $\epsilon \geqslant 0$ and parameter $\theta \in \mathbb{R}^W$ for which gradient descent converges.
\end{proof}


\section{DATASET DESCRIPTIONS} \label{secDatasets}

\subsection{California Housing Dataset}
The \textbf{California Housing} dataset contains housing data of California derived from the 1990 U.S. census. It is often used for regression predictive modeling tasks. The dataset has:
\begin{itemize}
    \item $|\mathcal{X}| = 20640$ examples, with $|J|$ training examples and $|\mathcal{X}| - |J|$ test examples.
    \item $d = 8$ features: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, and Longitude.
    \item Target Variable: Median house value for California districts.
\end{itemize}

\subsection{Wine Quality Dataset}
The \textbf{Wine Quality} dataset from consists of physicochemical tests and the quality of red and white vinho verde wine samples, from the north of Portugal. The dataset has:
\begin{itemize}
    \item $|\mathcal{X}| = 6497$ examples (combined red and white wine), with $|J|$ training examples and $|\mathcal{X}| - |J|$ test examples.
    \item $d = 11$ features: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol.
    \item Target Variable: Quality score between 0 and 10.
\end{itemize}

\subsection{Linear Synthetic Dataset}
The \textbf{Linear} dataset is a synthetic dataset, generated with fixed random seed, \texttt{rng = np.random.RandomState(0)} (in python code). The dataset has:
\begin{itemize}
    \item $|\mathcal{X}| = 6000$ examples, with $|J|$ training examples and $|\mathcal{X}| - |J|$ test examples.
    \item $d = 5$ features: $\mathcal{X}_1, \mathcal{X}_2, \mathcal{X}_3, \mathcal{X}_4, \mathcal{X}_5 \sim \mathcal{U}[0,1)$, where each feature is uniformly distributed between 0 and 1.
    \item Target Variable: Given by the equation
    \[ \mathcal{Y} = 0.5 \mathcal{X}_1 + 1.5 \mathcal{X}_2 + 2.5 \mathcal{X}_3 + 3.5 \mathcal{X}_4 + 4.5 \mathcal{X}_5 \]
\end{itemize}

\subsection{Nonlinear Synthetic Dataset}
The \textbf{Nonlinear} dataset is a synthetic dataset, produced with fixed random seed, \texttt{rng = np.random.RandomState(1)} (in python code). The dataset has:
\begin{itemize}
    \item $|\mathcal{X}| = 6000$ examples, with $|J|$ training examples and $|\mathcal{X}| - |J|$ test examples.
    \item $d = 7$ features: $\mathcal{X}_1, \mathcal{X}_2, \mathcal{X}_3, \mathcal{X}_4, \mathcal{X}_5, \mathcal{X}_6, \mathcal{X}_7 \sim \mathcal{U}[0,1)$, where each feature is uniformly distributed between 0 and 1.
    \item Target Variable: Given by the equation
    \[ \mathcal{Y} = \mathcal{X}_1 + \mathcal{X}_2^2 + \mathcal{X}_3^3 + \mathcal{X}_4^4 + \mathcal{X}_5^5 + e^{\mathcal{X}_6} + \sin(\mathcal{X}_7) \]
\end{itemize}

\subsection{MNIST Dataset}
The \textbf{MNIST} (Modified National Institute of Standards and Technology) dataset is a collection of handwritten digits commonly used for training image processing systems. While the original MNIST dataset consists of $50000$ training and $10000$ test examples, we consider a smaller version of the dataset (randomly partitioned from the original training and test datasets) that has:
\begin{itemize}
    \item $|\mathcal{X}| = 10000$ examples, with $|J|$ training examples (from the MNIST training examples) and $|\mathcal{X}| - |J|$ test examples (from the MNIST test examples).
    \item Each example (image) is of size \(28 \times 28\) pixels, represented as a grayscale intensity from 0 to 255.
    \item Target Variable: The actual digit the image represents, ranging from 0 to 9.
\end{itemize}

\subsection{CIFAR-10 Dataset}
The \textbf{CIFAR-10} dataset comprises color images categorized into 10 different classes, representing various objects and animals such as airplanes, cars, and birds. The images cover a broad range of scenarios, making the dataset highly versatile for various computer vision tasks. While the original CIFAR-10 dataset consists of $50000$ training and $10000$ test examples, we consider a smaller version of the dataset (randomly partitioned from the original training and test datasets) that has:

\begin{itemize}
    \item $|\mathcal{X}| = 10000$ examples, with $|J|$ training examples (from the CIFAR-10 training examples) and $|\mathcal{X}| - |J|$ test examples (from the CIFAR-10 test examples).
    \item Each example (image) is of size \(32 \times 32 \times 3\), with three color channels (Red, Green, Blue), and size \(32 \times 32\) pixels for each channel, represented as a grayscale intensity from 0 to 255.
    \item Target Variable: The class label of the image.
\end{itemize}


\section{ADDITIONAL EXPERIMENTS} \label{secAdditionalExperiments}

\subsection{Classification Tasks} \label{secClassification}
While the RLP loss was introduced in the scope of regression and reconstruction tasks, we note that the loss can also be applied to classification tasks. We provide a motivation for using the RLP loss for classification in Figure ~\ref{fig:rlp_illustration_classification} --- paralleling the regression case, we note that if two discontinuous functions with discrete images share the same hyperplanes connecting all subsets of their feature-label pairs, then they must necessarily be equivalent.

%intuition figure
\begin{figure}[h!]
    \centering
    \includegraphics[width=8cm]{figures/figure_rlp_classification.pdf}
    \caption{Comparison of true and predicted functions --- a demonstration that two discontinuous functions with discrete images are equivalent if and only if they share identical hyperplanes generated by all possible feature-label pairs.}
    \label{fig:rlp_illustration_classification}
\end{figure}
%As an example, we consider the following figure:

Accordingly, we observe that minimizing the RLP loss (and achieving zero loss) ensures that we learn the true [discontinuous] function --- this is supported by our theoretical findings in Section \ref{secTheoretical}. Our empirical results, obtained from datasets such as the Moons dataset (\texttt{sklearn.datasets.make\_moons} in python) and MNIST, affirm that the RLP loss offers accelerated convergence and superior outcomes in terms of accuracy and the F$_1$-score. Additionally, we employ mixup~\citep{zhang2017mixup} and juxtapose RLP loss against the cross-entropy loss when both are combined with mixup data augmentation (we further investigate mixup data augmentation for regression in section \ref{secRegReconstruct}). The results are illustrated in Figures~\ref{fig:moons} and~\ref{fig:mnist_classifcation}.

% Training/Test on Half of Dataset
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.45\textwidth}
        \includegraphics[width=\textwidth]{figures/MOONS_Training_Size_500.png}
        \caption{Moons}
        \label{moons}
    \end{subfigure}
    \begin{subfigure}{0.45\textwidth}
        \includegraphics[width=\textwidth]{figures/MOONS_Training_Size_500_MIXUP.png}
        \caption{Moons with Mixup}
        \label{moons_mixup}
    \end{subfigure}
\\
    \begin{subfigure}{0.45\textwidth}
        \includegraphics[width=\textwidth]{figures/MOONS_Training_Size_25.png}
        \caption{Moons}
        \label{moons_small}
    \end{subfigure}
    \begin{subfigure}{0.45\textwidth}
        \includegraphics[width=\textwidth]{figures/MOONS_Training_Size_25_MIXUP.png}
        \caption{Moons with Mixup}
        \label{moons_small_mixup}
    \end{subfigure}
    \caption{Comparing performance between Cross Entropy Loss and Random Linear Projections Loss for a classification task on the Moons dataset in terms of accuracy and F$_{1}$-score. Figure~\ref{moons} showcases results with $|J| = 900$ training and $|\mathcal{X}| - |J| = 100$ test examples ($|\mathcal{X}| = 1000$). Figure~\ref{moons_mixup} uses the same data split but is augmented with mixup. Figure~\ref{moons_small} employs a smaller set of $|J| = 25$ training examples and $|\mathcal{X}| - |J| = 475$ test examples ($|\mathcal{X}| = 500$), while Figure~\ref{moons_small_mixup} integrates the mixup data augmentation method on this smaller dataset. Both loss functions are evaluated across all scenarios.}
    \label{fig:moons}
\end{figure*}

% Training/Test on Half of Dataset
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.45\textwidth}
        \includegraphics[width=\textwidth]{figures/MNIST_Training_Size_5000.png}
        \caption{MNIST}
        \label{mnist_old}
    \end{subfigure}
    \begin{subfigure}{0.45\textwidth}
        \includegraphics[width=\textwidth]{figures/MNIST_Training_Size_5000_MIXUP.png}
        \caption{MNIST with Mixup}
        \label{mnist_mixup}
    \end{subfigure}
\\
    \begin{subfigure}{0.45\textwidth}
        \includegraphics[width=\textwidth]{figures/MNIST_Training_Size_100.png}
        \caption{MNIST}
        \label{mnist_small}
    \end{subfigure}
    \begin{subfigure}{0.45\textwidth}
        \includegraphics[width=\textwidth]{figures/MNIST_Training_Size_100_Mixup.png}
        \caption{MNIST with Mixup}
        \label{mnist_small_mixup}
    \end{subfigure}
    \caption{Performance comparison between Cross Entropy Loss and Random Linear Projections Loss for a classification task on MNIST, evaluated in terms of accuracy and F$_{1}$-score. Figure~\ref{mnist_old} showcases results with $|J| = 5000$ training and $|X| - |J| = 5000$ test examples ($|X| = 10000$). Figure~\ref{mnist_mixup} uses the same data split but is augmented with the mixup method. Figure~\ref{mnist_small} employs a smaller set of $|J| = 100$ training and $|X| - |J| = 1000$ test examples, while Figure~\ref{mnist_small_mixup} integrates the mixup data augmentation method on this smaller dataset. Both loss functions are evaluated across all scenarios.}
    \label{fig:mnist_classifcation}
\end{figure*}

\subsection{Regression and Reconstruction Tasks} \label{secRegReconstruct}
We now provide additional empirical results pertaining to the regression and reconstruction tasks outlined in the main text. As an extension, we compare RLP loss with \textbf{(1)} mixup-augmented MSE loss (MSE loss + Mixup) and \textbf{(2)} mixup-augmented RLP loss (RLP loss + Mixup). Regarding \textbf{(1)}, we use MSE loss to train the neural network on the virtual training examples produced by mixup, whereas in \textbf{(2)}, we use RLP loss to train the neural network on the virtual training examples formed using convex combinations between two unique pairs of sets of hyperplanes connecting fixed-size subsets of the neural network's feature-prediction pairs and feature-label pairs (see Algorithm \ref{alg:rlpmixup}).

\begin{algorithm}[h]
\caption{Neural Network Training With Mixup-Augmented RLP Loss}
\SetKwInput{KwInput}{Input}
\SetKwInput{KwOutput}{Output}
\label{alg:rlpmixup}
\DontPrintSemicolon

\KwInput{$J$ (Training dataset), $\theta$ (Initial NN parameters), $\alpha$ (Learning rate), $M$ (Batch size), \linebreak $K$ (Number of batches to generate), $E$ (Number of epochs), $\psi$ (Beta distribution shape parameter)}
\KwOutput{$\theta$ (Trained NN parameters)}

% Initialization
$\mathcal{B}_a \gets$ \texttt{Balanced\_Batch\_Generator}($J$, $M$, $K$) \\
$\mathcal{B}_b \gets$ \texttt{Balanced\_Batch\_Generator}($J$, $M$, $K$) 
%\tcp*{Generate batches}

\For{$\text{epoch} = 1,2,\ldots,E$}{
    \For{$j = 1,2,\ldots,K$}{
        $\textbf{x}_a, \textbf{x}_b \gets$ Matrix of features from batches $\mathcal{B}_a[j]$ and $\mathcal{B}_b[j]$, respectively \\ 
        $\textbf{y}_a, \textbf{y}_b \gets$ Vector of labels from batches $\mathcal{B}_a[j]$ and $\mathcal{B}_b[j]$, respectively \\
        \If{\upshape $\text{size}(\textbf{x}_a) \neq \text{size}(\textbf{x}_b)$}{
            %\Continue
        }
        % Form convex combination
        $\lambda \gets \text{Beta}(\psi,\psi)$ (Randomly sample from Beta distribution) \\
        $\textbf{x}_j \gets (\lambda)\textbf{x}_a + (1 - \lambda)\textbf{x}_b$ \\
        $\textbf{y}_j \gets (\lambda)\textbf{y}_a + (1 - \lambda)\textbf{y}_b$ \\
        % compute the observed regression matrix
        $M_y \gets \left(\textbf{x}_j^{\top}\textbf{x}_j\right)^{-1} \textbf{x}_j^{\top} \textbf{y}_j$\\
        $M_h \gets \left(\textbf{x}_j^{\top}\textbf{x}_j\right)^{-1} \textbf{x}_j^{\top} \textbf{h}_{\theta}(\textbf{x}_j)$\\
        % Compute RLP Loss
        $l_j(\theta) \gets  \left( \sum_{k=1}^M \left(M_y - M_h\right)^{\top} x_{j_k} \right )^2$ \\ %\tcp*{Compute RLP loss}
        }
        % Gradient Descent Update
    $\mathcal{L}(\theta) \gets \frac{1}{K} \sum_{j=1}^{K} l_j(\theta)$ \\
    $\theta \gets \theta - \alpha \nabla_{\theta} \mathcal{L}(\theta)$ %\tcp*{Update parameters}    
}
\Return{$\theta$} %\tcp*{Return trained parameters}
\end{algorithm}

\subsubsection{Performance Analysis}
Extending the first evaluation from the main text, we evaluate the efficacy of RLP loss compared to the mixup-augmented MSE loss and mixup-augmented RLP loss, when there are no ablations introduced within the data. We observe that across all three datasets, neural networks trained with RLP loss and mixup-augmented RLP loss achieve improved performance when compared to those trained with mixup-augmented MSE loss. The results are illustrated in Figure \ref{fig:MIXUP_epochs}.

% Training on Full Dataset (Mixup)
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Epochs_MIXUP.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Epochs_MIXUP.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Epochs_MIXUP.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using three different loss functions: mixup-augmented MSE, RLP, and mixup-augmented RLP. The x-axis represents training epochs, while the y-axis indicates the test MSE.}
    \label{fig:MIXUP_epochs}
\end{figure*}

We also compare the elapsed training time (in seconds) using RLP loss versus MSE loss on the California Housing, Wine Quality, and Nonlinear datasets. This result was compiled with an Intel Xeon CPU @ 2.20GHz, and is depicted in Figure \ref{fig:rlp_elapsed_time}. For the RLP loss case, we train the neural network using $K = 2000$ batches. 

% Training on Full Dataset (Mixup)
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Elapsed_Time_MSE.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Elapsed_Time_MSE.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Elapsed_Time_MSE.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using MSE loss and RLP loss. The x-axis represents elapsed time in seconds, while the y-axis indicates the test MSE.}
    \label{fig:rlp_elapsed_time}
\end{figure*}

\subsubsection{Ablation Study --- Number of Training Examples}
For the ablation study pertaining to the number of training examples, $|J|$, we first consider the case where $|J| = 100$ training examples. For this case, we train the regression neural network using MSE loss, MSE loss with $L_2$ regularization, or RLP loss. As in the $|J| = 50$ case from the main text, we observe that the RLP loss-trained models consistently outperform their MSE loss and $L_2$ regularized MSE loss-trained counterparts in both convergence rate and test error, despite the limited data. These results are illustrated in Figure~\ref{fig:Regular_trainsize_100}.

%  Number of Epochs
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Training_Size_100.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Training_Size_100.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Training_Size_100.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Limited training data ($|J| = 100$) test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using three different loss functions: MSE, MSE with $L_2$ regularization (MSE + $L_2$), and RLP. The x-axis represents training epochs, while the y-axis indicates the test MSE.}
    \label{fig:Regular_trainsize_100}
\end{figure*}

We also consider the case where $|J| = 100$ for the image reconstruction task. We observe that across both CIFAR-10 and MNIST, neural networks trained with RLP loss achieve improved performance when compared to those trained with MSE loss and $L_2$ regularized MSE loss. These results are illustrated in Figure \ref{fig:reconstruction_trainsize_100}.

%  Number of Epochs
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/cifar10_100.png}
        \caption{MNIST}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/mnist_100.png}
        \caption{CIFAR-10}
    \end{subfigure}
    \caption{Limited training data ($|J| = 100$) test performance comparison across two datasets (CIFAR-10 and MNIST) using three different loss functions: MSE, MSE with $L_2$ regularization, and RLP. The x-axis represents training epochs, while the y-axis indicates the test MSE.}
    \label{fig:reconstruction_trainsize_100}
\end{figure*}

Subsequently, we extend this study by evaluating the efficacy of RLP loss compared to the mixup-augmented MSE loss and mixup-augmented RLP loss when $|J| \in \{50,100\}$. We see that across all three datasets (California Housing, Wine Quality, and Nonlinear), neural networks trained with RLP loss and mixup-augmented RLP loss achieve improved performance when compared to those trained with mixup-augmented MSE loss. These results are illustrated in Figures \ref{fig:MIXUP_trainsize_50} and \ref{fig:MIXUP_trainsize_100}.

%  Number of Epochs
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Training_Size_50_MIXUP.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Training_Size_50_MIXUP.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Training_Size_50_MIXUP.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Limited training data ($|J| = 50$) test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using three different loss functions: mixup-augmented MSE, RLP, and mixup-augmented RLP. The x-axis represents training epochs, while the y-axis indicates the test MSE.}
    \label{fig:MIXUP_trainsize_50}
\end{figure*}

%  Number of Epochs
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Training_Size_100_MIXUP.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Training_Size_100_MIXUP.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Training_Size_100_MIXUP.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Limited training data ($|J| = 100$) test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using three different loss functions: mixup-augmented MSE, RLP, and mixup-augmented RLP. The x-axis represents training epochs, while the y-axis indicates the test MSE.}
    \label{fig:MIXUP_trainsize_100}
\end{figure*}


\subsubsection{Ablation Study --- Distribution Shift Bias}
We extend the distribution shift bias ablation study by evaluating the efficacy of RLP loss compared to the mixup-augmented MSE loss and mixup-augmented RLP loss for bias parameter $\gamma \in \{0.1,0.2,\ldots,0.9\}$. We observe that across all three datasets, neural networks trained with RLP loss and mixup-augmented RLP loss achieve improved performance when compared to those trained with mixup-augmented MSE loss. This result is illustrated in Figure \ref{fig:MIXUP_distribution_bias}.

%  Distribution Shift Bias
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Shift_MIXUP.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Shift_MIXUP.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Shift_MIXUP.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Distribution shift test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using three different loss functions: mixup-augmented MSE, RLP, and mixup-augmented RLP. The x-axis is the degree of bias, $\gamma$, between the test data and the training data, while the y-axis indicates the test MSE.}
    \label{fig:MIXUP_distribution_bias}
\end{figure*}

\subsubsection{Ablation Study --- Noise Scaling Factor}
We extend the noise scaling factor ablation study by evaluating the efficacy of RLP loss compared to the mixup-augmented MSE loss and mixup-augmented RLP loss for standard normal noise scaling factor $\beta \in \{0.1,0.2,\ldots,0.9\}$. We observe that across all three datasets, neural networks trained with RLP loss and mixup-augmented RLP loss achieve improved performance when compared to those trained with mixup-augmented MSE loss. This result is illustrated in Figure \ref{fig:MIXUP_additive_noise}.

%  Additive Noise
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Noise_MIXUP.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Noise_MIXUP.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Noise_MIXUP.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Noise robustness test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using three different loss functions: mixup-augmented MSE, RLP, and mixup-augmented RLP. The x-axis is the scaling factor, $\beta$, for the additive standard normal noise, while the y-axis indicates the test MSE.}
    \label{fig:MIXUP_additive_noise}
\end{figure*}

\subsubsection{RLP Loss vs. Mean Absolute Error (MAE) Loss}
Mean Absolute Error (MAE) loss is widely utilized in machine learning for its simplicity and interpretability, particularly in regression tasks. Its effectiveness is underscored by research demonstrating its superiority in vector-to-vector regression and in enhancing neural network training with noisy labels, showcasing its adaptability across various applications \cite{Qi2020MAE, Zhang2018GeneralizedCEL}. Paralleling the first evaluation from the main text, we evaluate the efficacy of RLP loss compared to MAE loss, when there are no ablations introduced within the data, for $|J| = 0.5|\mathcal{X}|$ training examples and $|\mathcal{X}| - |J|$ test examples. We observe that across all three datasets, neural networks trained with RLP loss achieve improved performance when compared to those trained with MAE loss. The results are illustrated in Figure \ref{fig:MAE_epochs}.
% Training on Full Dataset (MAE)
\begin{figure*}[h!]
    \centering
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/California_Housing_vs_Epochs_MAE.png}
        \caption{California Housing}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Wine_Quality_vs_Epochs_MAE.png}
        \caption{Wine Quality}
    \end{subfigure}
    \begin{subfigure}{0.33\textwidth}
        \includegraphics[width=\textwidth]{figures/Nonlinear_vs_Epochs_MAE.png}
        \caption{Nonlinear}
    \end{subfigure}
    \caption{Test performance comparison across three datasets (California Housing, Wine Quality, and Nonlinear) using two different loss functions: RLP and MAE. The x-axis represents training epochs, while the y-axis indicates the test MAE.}
    \label{fig:MAE_epochs}
\end{figure*}

\section{EXPERIMENT DETAILS}

\subsection{Neural Network Architectures} \label{secNN_Architectures}
We first provide a detailed description of four different neural network architectures designed for various tasks: regression, image reconstruction, and classification. Each of these architectures were employed to generate the respective empirical results pertaining to the aforementioned tasks.

\subsubsection{Regression Neural Network}
The Regression Neural Network utilized in our analysis is designed for regression tasks, mapping input features to continuous output values (see Figure \ref{fig:NN_architectures}). The architecture comprises the following layers:

\begin{itemize}
    \item \textbf{Fully Connected Layer (\texttt{fc1})}: Transforms the input features to a higher dimensional space. It takes $d$-dimensional inputs and yields $32$-dimensional outputs.
    \item \textbf{ReLU Activation (\texttt{relu1})}: Introduces non-linearity to the model. It operates element-wise on the output of \texttt{fc1}.
    \item \textbf{Fully Connected Layer (\texttt{fc2})}: Takes $32$-dimensional inputs and yields $1$-dimensional outputs (final predictions).
\end{itemize}

\subsubsection{Autoencoders for Image Reconstruction}
The Autoencoder utilized in our analysis is tailored for image reconstruction tasks (see Figure \ref{fig:NN_architectures}). The architecture consists of two main parts: an encoder and a decoder. We note that preliminarily, all images are flattened to $d$-dimensional inputs and have their pixel values normalized to be within the range $[0, 1]$.
\begin{itemize}
    \item \textbf{Encoder}:
    \begin{itemize}
        \item \textbf{Fully Connected Layer (\texttt{fc1})}: Encodes the flattened, $d$-dimensional input into a latent representation of size $32$.
        \item \textbf{ReLU Activation (\texttt{relu1})}: Introduces non-linearity to the encoding process.
    \end{itemize}
    
    \item \textbf{Decoder}:
    \begin{itemize}
        \item \textbf{Fully Connected Layer (\texttt{fc2})}: Transforms the $32$-dimensional latent representation into a $d$-dimensional output.
        \item \textbf{Sigmoid Activation (\texttt{sig1})}: Ensures the output values are in the range $[0, 1]$ (akin to normalized pixel values).
    \end{itemize}
\end{itemize}

\begin{figure}[h!]
    \centering
    \begin{subfigure}{0.4\textwidth}
        \includegraphics[width=\textwidth]{figures/Regression_Neural_Network.png}
        \caption{Regression Neural Network}
    \end{subfigure}
    \begin{subfigure}{0.4\textwidth}
        \includegraphics[width=\textwidth]{figures/Autoencoder.png}
        \caption{Autoencoder}
    \end{subfigure}
    \caption{Architectures of the regression neural network (left) and autoencoder (right)}
    \label{fig:NN_architectures}
\end{figure}

\subsubsection{LeNet-5 for Image Classification on MNIST}
LeNet-5 \citep{lecun1998lenet}, a convolutional neural network, is widely used for image classification tasks such as handwritten digit recognition (e.g., MNIST). For our MNIST image classification study, we preliminarily zero-pad the images so they are of size $32 \times 32$. The employed architecture of LeNet-5 (see Figure \ref{fig:lenet5}) consists of the following layers:

\begin{itemize}
    \item \textbf{Convolutional Layer \texttt{(conv1)}}: Applies $6$ filters of size \(5 \times 5\) to the input image.
    \item \textbf{Tanh Activation \texttt{(tanh1)}}: Applies the hyperbolic tangent activation function element-wise.
    \item \textbf{Average Pooling Layer \texttt{(pool1)}}: Down-samples the feature map by a factor of $2$.

    \item \textbf{Convolutional Layer \texttt{(conv2)}}: Applies $16$ filters of size \(5 \times 5\).
    \item \textbf{Tanh Activation \texttt{(tanh)}}: Applies the hyperbolic tangent activation function element-wise.
    \item \textbf{Average Pooling Layer \texttt{(pool2)}}: Further down-samples the feature map by a factor of $2$.
    
    \item \textbf{Flattening Layer \texttt{(flatten1)}}: Transforms the $2$-dimensional feature map into a flat vector.
    
    \item \textbf{Fully Connected Layer \texttt{(fc1)}}: Transforms the flat vector to a $120$-dimensional space.
    \item \textbf{Tanh Activation \texttt{(tanh)}}: Applies the hyperbolic tangent activation function element-wise.
    
    \item \textbf{Fully Connected Layer \texttt{(fc2)}}: Reduces the dimensionality to $84$.
    \item \textbf{Tanh Activation \texttt{(tanh)}}: Applies the hyperbolic tangent activation function element-wise.
    
    \item \textbf{Fully Connected Layer \texttt{(fc3)}}: Produces the final classification output with $10$ dimensions.
\end{itemize}
The above architecture is considered when we use cross-entropy loss for image classification on MNIST. However, when RLP loss is employed, we include a sigmoid activation layer, \texttt{sig1}, that follows the last fully connected layer, \texttt{fc3}:
\begin{itemize}
    \item \textbf{Sigmoid Activation (\texttt{sig1})}: Ensures the output values are in the range $[0, 1]$ (probabilistic classification).
\end{itemize}

\begin{figure}[h]
\centering
\includegraphics[width=0.65\linewidth]{figures/CNN_MNIST.png}
\caption{Architecture of LeNet-5 for image classification on MNIST}
\label{fig:lenet5}
\end{figure}

\subsubsection{MoonsClassifier for Classification on the Moons Dataset}
The provided \textit{MoonsClassifier} is a neural network designed for classifying examples from the Moons dataset, which consists of $2$-dimensional points forming two interleaved half-circle shapes. The architecture of \textit{MoonsClassifier} (see Figure \ref{fig:moonclassifier}) consists of three fully connected layers and two ReLU activation functions, as detailed below:

\begin{itemize}
    \item \textbf{Fully Connected Layer \texttt{(fc1)}}: Transforms the $2$-dimensional input to a $50$-dimensional space. The input features represent the coordinates of a point in the dataset.
    \item \textbf{ReLU Activation \texttt{(relu1)}}: Applies the ReLU activation function element-wise, introducing non-linearity.
    
    \item \textbf{Fully Connected Layer \texttt{(fc2)}}: Further transforms the data in the $50$-dimensional space.
    \item \textbf{ReLU Activation \texttt{(relu2)}}: Applies the ReLU activation function element-wise.
    
    \item \textbf{Fully Connected Layer \texttt{(fc3)}}: Reduces the dimensionality from $50$ to $2$, producing the final classification output.
\end{itemize}
The above architecture is considered when we use cross-entropy loss for classification on the Moons dataset. However, when RLP loss is employed, we include a sigmoid activation layer, \texttt{sig1}, that follows the last fully connected layer, \texttt{fc3}:
\begin{itemize}
    \item \textbf{Sigmoid Activation (\texttt{sig1})}: Ensures the output values are in the range $[0, 1]$ (probabilistic classification).
\end{itemize}

\begin{figure}[h]
\centering
\includegraphics[width=0.55\linewidth]{figures/NN_MOON.png}
\caption{Architecture of \textit{MoonsClassifier} for image classification on the Moons dataset}
\label{fig:moonclassifier}
\end{figure}

\clearpage
\subsubsection{Neural Network Training Hyperparameters}
The relevant hyperparameters used to train the regression neural networks, autoencoders, and classifiers outlined in Section \ref{secNN_Architectures} are provided in Tables \ref{hyperparameters-table-regression} and \ref{hyperparameters-table-classification}. All results presented in the main text and in Sections \ref{secClassification} and \ref{secRegReconstruct} of the appendix were produced using these hyperparameter choices.

\begin{table*}[h]
\caption{Regression and reconstruction tasks --- neural network training hyperparameters (grouped by dataset).}
\label{hyperparameters-table-regression}
\begin{center}
\begin{tabular}{c|c|c|c|c|c}
\hline 
\textbf{Dataset} & \textbf{Experiment} & \textbf{Optimizer} & \thead{\textbf{Learning} \\ \textbf{Rate} ($\boldsymbol{\alpha}$)} & \thead{\textbf{Weight Decay \footnotemark} \\ (MSE loss + $L_2$)} & \thead{\textbf{Shape Parameter} ($\boldsymbol{\psi}$) \\ (Mixup \& RLP + Mixup)} \\
\hline
California Housing & No ablations & Adam & 0.0001 & 0.0001 & 0.25 \\
California Housing & \small $|J| \in \{50,100\}$ & AdamW & 0.0005 & 0.01 & 0.25 \\
California Housing & Distribution shift & Adam & 0.0001 & 0.0001 & 0.25 \\
California Housing & Additive noise & Adam & 0.0001 & 0.0001 & 0.25 \\
California Housing & RLP vs. MAE & Adam & 0.0005 & --- & --- \\
California Housing & Training Time & Adam & 1.0\text{e-}6 & --- & --- \\
\hline
Wine Quality & No ablations & Adam & 0.0001 & 0.0001 & 0.25 \\
Wine Quality & \small $|J| \in \{50,100\}$ & AdamW & 0.005 & 0.01 & 0.25 \\
Wine Quality & Distribution shift & Adam & 0.0001 & 0.0001 & 0.25 \\
Wine Quality & Additive noise & Adam & 0.0001 & 0.0001 & 0.25 \\
Wine Quality & RLP vs. MAE & Adam & 0.0005 & --- & --- \\
Wine Quality & Training time & Adam & 1.0\text{e-}6 & --- & --- \\
\hline
Linear & No ablations & Adam & 0.0001 & 0.0001 & --- \\
Linear & \small $|J| \in \{50,100\}$ & AdamW & 0.0005 & 0.01 & --- \\
\hline
Nonlinear & No ablations & Adam & 0.0001 & 0.0001 & 0.25 \\
Nonlinear & \small $|J| \in \{50,100\}$ & AdamW & 0.0005 & 0.01 & 0.25 \\
Nonlinear & Distribution shift & Adam & 0.0001 & 0.0001 & 0.25 \\
Nonlinear & Additive noise & Adam & 0.0001 & 0.0001 & 0.25 \\
Nonlinear & RLP vs. MAE & Adam & 0.0005 & --- & --- \\
Nonlinear & Training time & Adam & 1.0\text{e-}6 & --- & --- \\
\hline
MNIST & No ablations & SGD & 0.01 & 0.0001 & --- \\
MNIST & \small $|J| \in \{50,100\}$ & SGD & 0.01 & 0.0001 & --- \\
\hline
CIFAR-10 & No ablations & SGD & 0.01 & 0.0001 & --- \\
CIFAR-10 & \small $|J| \in \{50,100\}$ & SGD & 0.01 & 0.0001 & --- \\
\hline
\end{tabular}
\end{center}
\end{table*}

\begin{table*}[h]
\caption{Classification tasks --- neural network training hyperparameters (grouped by dataset).}
\label{hyperparameters-table-classification}
\begin{center}
\begin{tabular}{c|c|c|c|c|c}
\hline 
\textbf{Dataset} & \textbf{Experiment} & \textbf{Optimizer} & \thead{\textbf{Learning} \\ \textbf{Rate} ($\boldsymbol{\alpha}$)} & \thead{\textbf{Weight Decay \textsuperscript{\ref{note1}}} \\ (MSE loss + $L_2$)} & \thead{\textbf{Shape Parameter} ($\boldsymbol{\psi}$) \\ (Mixup \& RLP + Mixup)} \\
\hline
Moons Dataset & No ablations & Adam & 0.001 & --- & 0.2 \\
Moons Dataset & \small $|J| = 25$ & Adam & 0.001 & --- & 0.4 \\
\hline
MNIST & No ablations & AdamW & 0.002 & --- & 0.2 \\
MNIST & \small $|J| = 100 $ & AdamW & 0.002 & --- & 0.2 \\
\hline
\end{tabular}
\end{center}
\end{table*}
\footnotetext{\label{note1} The default AdamW weight decay is set to $0.0001$ in all relevant experiments and is only changed for MSE loss + $L_2$ regularization.}


\end{document}
