%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
% version; also before submission to
% see how the non-anonymous paper
% would look like
%% There is a class option to choose the math font
%\documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
% ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{abbrvnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
    
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amsfonts} %mathbb
\usepackage{amsmath}
\usepackage{cleveref}
\DeclareMathOperator{\R}{\mathbb R}
\usepackage{xstring}
% Subfigure and subcaption
\usepackage{caption}
\usepackage{subcaption}
\usepackage{amssymb}
% Definition
\usepackage{amsthm}
\usepackage{bm}
% \theoremstyle{definition}
\theoremstyle{plain}
\newtheorem{metric}{Metric}[section]
\newtheorem{theorem}{Theorem}[section]
\newtheorem{corollary}{Corollary}[theorem]
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{definition}[theorem]{Definition}


%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\newcommand{\RomanNumeralCaps}[1]
    {\MakeUppercase{\romannumeral #1}}
    
% Hard code appendix macro
\newcommand{\appref}[1]{%
    \IfEqCase{#1}{%
        {app:sec:walsh_hadamard}{A}%
        {app:sec:HashWH_details}{B}%
        {app:subsec:hash_sum_proof}{B.1}%
        {app:subsec:collision_probability}{B.2}%
        {app:subsec:EN-S}{B.3}%
        {app:sec:datasets}{C}%
        {app:sec:technical_details}{D}%
        {app:sec:ablation_details}{E}%
        {app:subsec:tree_fourier}{E.1}
        {app:sec:extended_results}{F}%
        {app:subsec:evolution_detailed}{F.1}%
        {app:subsec:synthetic_detailed}{F.2}%
        {app:subsec:real_detailed}{F.3}%
        % you can add more cases
    }[\textbf{???}]%
}%

% Avoid hardcoding
% \newcommand{\appref}[1]{\ref{#1}}

% unnumbered footnote
\newcommand\nnfootnote[1]{%
  \begin{NoHyper}
  \renewcommand\thefootnote{}\footnote{#1}%
  \addtocounter{footnote}{-1}%
  \end{NoHyper}
}

% \setlength{\belowcaptionskip}{-2mm}
% \setlength{\abovecaptionskip}{0mm}

\title{A Scalable Walsh-Hadamard Regularizer to Overcome the \\Low-degree Spectral Bias of Neural Networks}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
% \author[1]{\href{mailto:<agorji@ethz.ch>?Subject=In regard to the paper 'Scalable regularization of Neural Networks in Fourier space'}{Ali Gorji}{}}
\author[1]{Ali Gorji$^*$}
\author[1]{Andisheh Amrollahi$^*$}
\author[1]{Andreas Krause}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Department\\
    ETH Zurich\\
    Zurich, Switzerland
}
  
  
\begin{document}
\maketitle


% \def\thefootnote{*}\footnotetext{These authors contributed equally to this work}\def\thefootnote{\arabic{footnote}}

\begin{abstract}
% \vspace{-4mm}
  % This is the abstract for this article.
  % It should give a self-contained single-paragraph summary of the article's contents, including context, results, and conclusions.
  % Avoid citations; but if you do, you must give essentially the whole reference.
  % For example: This whole paper is devoted to praising É. Š. Åland von Vèreweg's most recent book (“Utopia's government formation problems during the last millenium”, Springevier Publishers, 2016).
  % Also, do not put mathematical notation and abbreviations in your abstract; be descriptive.
  % So not “we solve \(x^2+A xy+y^2\), where \(A\) is an RV”, but “we solve quadratic equations in two unknowns in which a single coefficient is a random variable”.
  % The reason is that mathematical notation will not display correctly when the abstract is reused on the proceedings website, for example, and that one should not assume the abstract's reader knows the abbreviation.
  % Of course the same remarks hold for your paper's title.
\looseness -1 Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards ``simpler'' functions. Various notions of simplicity have been introduced to characterize this behavior. Here, we focus on the case of neural networks with discrete (zero-one), high-dimensional, inputs through the lens of their Fourier (Walsh-Hadamard) transforms, where the notion of simplicity can be captured through the \emph{degree} of the Fourier coefficients. We empirically show that neural networks have a tendency to learn lower-degree frequencies. 
We show how this spectral bias towards low-degree frequencies can in fact \emph{hurt} the neural network's generalization on real-world datasets. To remedy this we propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies. Our regularizer also helps avoid erroneous identification of low-degree frequencies, which further improves generalization. We extensively evaluate our regularizer on synthetic datasets to gain insights into its behavior. Finally, we show significantly improved generalization on four different datasets compared to standard neural networks and other relevant baselines.   
\end{abstract}
\nnfootnote{*These authors contributed equally to this work}


% \vspace{-5mm}
\section{Introduction}\label{sec:intro}
% \vspace{-3mm}
% First paragraph
% \begin{enumerate}
%     \item Importance of regularization
% \end{enumerate}


\looseness -1 Classical work on neural networks shows that deep fully connected neural networks have the capacity to approximate arbitrary functions \citep{hornik1989multilayer, cybenko1989approximation}. However, in practice, neural networks trained through (stochastic) gradient descent have a ``simplicity'' bias. This notion of simplicity is not agreed upon and works such as \citep{arpit_closer_2017, nakkiran2019sgd, valle2018deep, kalimeris_sgd_2019} each introduce a different notion of ``simplicity''. The simplicity bias can also be studied by considering the function the neural net represents (function space view) and modeling it as Gaussian processes (GP)\citep{rasmussen2004gaussian}. \citet{daniely2016toward, lee2018deep} show that a wide, randomly initialized, neural network in function space is a sample from a GP  with a kernel called the ``Conjugate Kernel'' \citep{daniely2017sgd}. Moreover, the evolution of gradient descent on a randomly initialized neural network  can be described through the ``Neural Tangent Kernel'' \citet{jacot_neural_2018, lee2019wide}. These works open up the road for analyzing the simplicity bias of neural nets in terms of a \emph{spectral} bias in Fourier space. \citet{rahaman_spectral_2019} show empirically that neural networks tend to learn sinusoids of lower frequencies earlier on in the training phase compared to those of higher frequencies. Through the GP perspective introduced by  \citet{jacot_neural_2018, lee2019wide}, among others, \citet{ronen_convergence_2019, basri2020frequency} were able to prove these empirical findings. These results focus on {\em continuous} domains and mainly emphasize the case where the input and output are both \emph{one-dimensional}. 

Here,  we focus on {\em discrete} domains where the input is a \emph{high-dimensional} zero-one vector and we analyze the function learned by the neural network in terms of the amount of interactions among its input features in a quantitative manner. Our work is complementary to the majority of the aforementioned work that has been done on the spectral bias of neural networks in the setting of \emph{continuous}, \emph{one-dimensional} inputs  \citep{ronen_convergence_2019, basri2020frequency, rahaman_spectral_2019}. \citet{yang_fine-grained_2020, valle2018deep}  are the first to provide spectral bias results for the discrete, higher dimensional, setting (our setting). By viewing a fully connected neural network as a function that maps zero-one vectors to real values, one can expand this function in terms of the Fourier --a.k.a Walsh-Hadamard -- basis functions. The Walsh-Hadamard basis functions have a natural ordering in terms of their complexity called their {\em degree}. The degree specifies how many features each basis function is dependent upon. For example, the zero-degree basis function is the constant function and the degree-one basis functions are functions that depend on exactly one feature. Through analysis of the NTK gram matrix on the Boolean cube, \citet{yang_fine-grained_2020} theoretically show that, roughly speaking, neural networks learn the lower degree basis functions earlier in training. 

% Here we extend their work by analyzing a neural network trained on summations of such functions i.e $k>1$-sparse functions and analyzing their behavior during training (over different epochs). We also checked for a variety of different dataset sizes. Our ultimate goal is to improve their generalization performance in learning these functions.  Therefore, we conduct a variety of experiments that shed insight into neural networks' degree-dependent, spectral behavior during training and its effect on overfitting and generalization. This guides us in proposing our novel regularizer \textsc{HashWH}. 




\looseness -1 This tendency to prioritize simpler functions in neural networks has been suggested as a cardinal reason for their remarkable generalization ability despite their over-parameterized nature \citep{neyshabur_exploring_2017, arpit_closer_2017, kalimeris_sgd_2019, poggio_theory_2018}. However, much less attention has been given to the case where the simplicity bias can {\em hurt} generalization \citep{tancik2020fourier, shah_pitfalls_2020}. 
%\citet{shah_pitfalls_2020} show that neural networks after training are functionally dependent on the ``simplest'' feature, even if it is less predictive than more complex features, in terms of for example classification accuracy. The features in their work is continuous and a ``simple'' feature is one that solves the problem of classification with single thresholding as opposed to several thresholds. 
\citet{tancik2020fourier} show how transforming the features with random Fourier features embedding helps the neural network overcome its spectral bias and achieve better performance in a variety of tasks. They were able to explain, in a unified way, many empirical findings in computer vision research such as sinusoidal positional embeddings through the lens of overcoming the spectral bias. In the same spirit as these works, we show that the spectral bias towards low-degree functions  can hurt generalization and how to remedy this through our proposed regularizer.



%\citep{yang_fine-grained_2020, ronen_convergence_2019, valle2018deep, neyshabur2015search, neyshabur_exploring_2017-1, soudry_implicit_2018, poggio_theory_2018, arpit_closer_2017, kalimeris_sgd_2019, rajeswaran_towards_2017, huh_low-rank_2022}


%were introduced to prevent overfitting to the training data or equivalently enhance generalizability. Weight decay or L2-regularization,
%\citep{krogh1991simple},
%L1-regularization, batch normalization,
% \citep{ioffe2015batch}, 
%and dropout
% \citep{srivastava2014dropout}
%are prominent examples that are widely adopted. 
In more recent lines of work, regularization schemes have been proposed to directly impose priors on the function the neural network represents \citep{benjamin_measuring_2019, sun_functional_2019, wang_function_2019}. This is in contrast to other methods such as dropout, batch normalization, or other methods that regularize the weight space. 
In this work, we also regularize neural networks in function space by imposing sparsity constraints on their Walsh-Hadamard transform. Closest to ours is the work of \citet{aghazadeh_epistatic_2021}. Inspired by studies showing that biological landscapes are sparse and contain high-degree frequencies \citep{sailer_detecting_2017, yang_higher-order_2019, brookes_sparsity_2022, ballal_sparse_2020, poelwijk_learning_2019}, they propose a functional regularizer to enforce sparsity in the Fourier domain and report improvements in generalization scores.
% , ,,, , , , }).
% * Worth rethinking candidates:
% -- pan_continual_2020: not the vanilla NN setup but the Continual Learning, an online learning setup in which tasks are presented sequentially. But literally regularizing the network in function space using GPs (https://proceedings.neurips.cc/paper/2020/file/2f3bbb9730639e9ea48f309d9a79ff01-Paper.pdf).
% -- cheng_control_2019: does the functional regularisation over three famous Deep RL frameworks (https://arxiv.org/pdf/1905.05380.pdf)
% -- piche_bridging_2022: seems to be really functional regularizing the value network in Deep RL, preprint though (https://arxiv.org/pdf/2210.12282.pdf).
% -- lyle_understanding_2022: functional regularization again in deep RL settings, "we proposed a regularizer to preserve capacity, yielding improved performance across a number of settings in which deep RL agents have historically struggled to match human performance.", (https://openreview.net/forum?id=ZkC8wKoLbQ7).

% * Probably irrelevant but still function space:
% -- titsias_functional_2020: "We introduced a functional regularisation approach for supervised continual learning that combines inducing point GP inference with deep neural networks.", good paper but not regularising Neural networks
% -- von_oswald_continual_2022: Continual learning setup + weight generator for NN
% -- li_fast_2019: despite directly working with Funtion space, defined over GNNs as well as being an operator than a regularizer
% rudner_tractable_2022: BNN, not regular NNs




\textbf{Our contributions:}
% \vspace{-2mm}
\begin{itemize}[leftmargin=*]
    \item We analyze the spectral behavior of a simple MLP during training through extensive experiments. We show that the standard (unregularized) network not only is unable to learn (more complex) high-degree frequencies but it also starts learning erroneous low-degree frequencies and hence overfitting on this part of the spectrum.  
    \item \looseness -1 We propose a novel regularizer -- \textsc{HashWH} (Hashed Walsh Hadamard) -- to remedy the aforementioned phenomenon. The regularizer acts as a ``sparsifier'' on the Fourier (Walsh-Hadamard) basis. In the most extreme cases, it reduces to simply imposing an $L_1$-norm on the Fourier transform of the neural network. Since computing the exact Fourier transform of the neural net is intractable, our regularizer hashes the Fourier coefficients to buckets and imposes an L1 norm on the buckets. By controlling the number of hash buckets, it offers a smooth trade-off between computational complexity and the quality of regularization. 
    \item We empirically show that \textsc{HashWH} aids the neural network in avoiding erroneous low-degree frequencies and also learning relevant high-degree frequencies. %The regularizer guides gradient descent to train a function that has less energy in the lower-degree components and more energy in the high-degree components of its spectrum. 
    The regularizer guides the training procedure to allocate more energy to the high-frequency part of the spectrum when needed and allocate less energy to the lower frequencies when they are not present in the dataset.
    
    \item We show on real-world datasets that, contrary to popular belief of simplicity biases for neural networks, fitting a low degree function does not imply better generalization. Rather, what is more important, is keeping the \emph{higher amplitude} coefficients regardless of their degree. We use our regularizer on four real-world datasets and provide state of the art results in terms of $R^2$ score compared to standard neural networks and other baseline ML models, especially for the low-data regime. %We plot its performance for a variety of different training set sizes. 
\end{itemize}

% Neural networks exhibit an enormous capacity for approximating any arbitrary function, ranging from smoothly convex to highly non-convex ones 
% (maybe few citations).
% (bridge to regularization or capacity control stuff).
% To control the learning procedure,  which is often an iterative derivative-based optimization, researchers used various types of regularization techniques. Weight decay or $L2$-regularization, $L1$-regularization, Batch normalization, and Dropout, are prominent examples that are widely adopted in the community.
% In addition to these, many other available techniques are directly focusing on regularizing network parameters
% (\citet{wang_sparcl_2022, louizos_learning_2022, gu_towards_2020, sun_sparse-softmax_2021}). 
% This is understandable, given the inherent reduction of the usually intractable function space to the tractable network parameter space, done by Neural Networks.

% % Second paragraph:
% % \begin{enumerate}
% %     \item Function space analysis
% %     \item Typical weight space regularization
% %     \item Importances + instances of function space regularization
% % \end{enumerate}

% In a more general view, the ultimate objective is to find the best function that is not only able to map the input or \emph{training data} into the observed values, but also is able to accurately map unobserved inputs or \emph{test data} into the correct output. This is commonly referred to as "Generalization". regularization techniques suggested for Neural Networks are based on assumptions on how this generalization should be characterized and ergo effectively enforced. Unlike more frequent assumptions founded on network parameter space, Generalization can be also seen through the lens of the function space. In this perspective, Neural Networks are learning a function in the space of all functions mapping the input space to the output space, the quality of which can be assesed through the function space of desire.

% For instance, \citet{benjamin_measuring_2019} analyses the training procedure of Neural Networks using SGD in $L_2$ Hilbert space and shows that despite the diverging learning trajectories in the parameter space caused by multiple network initializations, they share the same start point and early-stage learning trajectories in the $L_2$ function distance space, which diverges in the later parameter updates. They propose a regularization technique to restrict the changes in the $L_2$ distance space, assuming that the locality of the updates in the distance space, is similar to the locality in the parameter space imposed by SGD. In a similar fashion, there are a handful of studies on analyzing Neural Networks in the function space (\citet{bai_geometric_2022, ha_adaptive_2021, ronen_convergence_2019, guiroy_towards_2019, sun_function-space_2022}, and regularization techniques based on function space perspectives (\citet{cheng_control_2019, von_oswald_continual_2022, pan_continual_2020,titsias_functional_2020, piche_bridging_2022, li_fast_2019, rudner_tractable_2022, lyle_understanding_2022, dong_how_2021}).

% The tendency to prioritize simpler functions, or so-called \emph{simplicity bias}, has been empirically observed and theoretically analyzed in Neural Networks, and is suggested as a cardinal reason for their remarkable generalization power despite their over-parameterized nature (\citet{yang_fine-grained_2020, ronen_convergence_2019, valle2018deep, neyshabur2015search, neyshabur_exploring_2017, soudry_implicit_2018, poggio_theory_2018, arpit_closer_2017, kalimeris_sgd_2019, rajeswaran_towards_2017, huh_low-rank_2022, khan_regularization_2019, li_functional_2021}). 
% Apart from many of these studies, "Simplicity" has been commonly taken as a surrogate for the generalizability of the solution
% (some extra citations).
% % In line with this finding, "Simplicity" has been taken as a surrogate for generalizability in many of regularization studies.
% Simplicity has been sought through a wide range of perspectives, among which different forms of sparsity induction received great attention (\citet{hou_spectral_2022, louizos_learning_2022, rajeswaran_towards_2017, li_smooth_2018, wang_sparcl_2022}).

% From the Fourier perspective, a natural notion of simplicity is \emph{low-complexity} and \emph{sparsity} in the frequency domain. 
% Among the two, recent studies demonstrate an available implicit bias for learning low-complexity frequencies earlier in the training of Neural Networks (\citet{rahaman_spectral_2019, ronen_convergence_2019, xu_understanding_2018, xu_training_2019, xu_frequency_2020}). 
% This phenomenon justifies Early Stopping as a mechanism of stopping the training before learning higher-complexity frequencies, which are often assumed to be unnecessary noises.
% Hence, imposing mere \emph{sparsity} in the frequency domain is the logical step towards simplicity in Fourier space. 
% Inspired by the spectral bias observed in many biological landscapes (\citet{sailer_detecting_2017, yang_higher-order_2019, brookes_sparsity_2022, ballal_sparse_2020, poelwijk_learning_2019}),
% \citet{aghazadeh_epistatic_2021} proposes regularization techniques to enforce sparsity in the Fourier domain and reports performance improvements in learning various biological fitness landscapes. This is fully aligned with our intuition of simplicity in Fourier space.

% In this work, we empirically analyze learning of set functions using Neural Networks through Fourier space, more specifically, Walsh-Hadamard basis. We propose a scalable regularizer to enforce sparsity in the Fourier frequency domain during training that is inspired by a common technique used in the Compress Sensing literature.
% (To be completed last)


% Third paragraph:
% \begin{enumerate}
%     \item Compress sensing makes it possible -> The downside is that it requires oracle access which is not possible in many scenarios
%     \item Fourier space analysis in NNs (+ necessary for low data regime)
%     \item Available Fourier space regularizations + shortcomings
% \end{enumerate}

% Fourth paragraph:
% \begin{enumerate}
%     \item Our contribution by two methods and their justifications
% \end{enumerate}

% \input{background}
% \vspace{-5mm}
\section{Background}\label{sec:background}
% \vspace{-3mm}
In this section, we first review Walsh Hadamard transforms, and notions of degree and sparsity in the Fourier (Walsh-Hadamard) domain \citep{o2014analysis}. Next, we review the notion of simplicity biases in neural networks and discuss why they are spectrally biased toward low-degree functions.
% \vspace{-7mm}
\subsection{Walsh Hadamard transforms}
\label{subsec:wht_background}
% \vspace{-3mm}
Let $g:\{0,1\}^n \rightarrow \R$ be a function mapping Boolean zero-one vectors to the real numbers, also known as a ``pseudo-boolean'' function. The family of $2^n$ functions $\{\Psi_f: \{0,1\}^n \rightarrow \R | f \in \{0,1\}^n\}$ defined below consists of the Fourier basis functions. This family forms a basis over the vector space of all pseudo-boolean functions:
\[
\Psi_f(x) = \frac{1}{\sqrt{2^n}} (-1)^{\langle f, x \rangle}, f,x \in \{0,1\}^n
\]
where ${\langle f, x \rangle} = \sum_i f_i x_i$. Here, $f \in \{0,1\}^n$ is called the \emph{frequency} of the basis function. For any frequency $f \in \{0,1\}^n$ we denote its \emph{degree} by $\text{deg}(f)$  which is defined as the number of non-zero elements. For example, $f_1=[0,0,0,0,0]$ and $f_2=[0,0,1,0,1]$ have degrees $\text{deg}(f_1)=0$ and $\text{deg}(f_2)=2$, respectively. One can think of the degree as a measure of the complexity of basis functions. For example, $\Psi_0(x)$ is constant, and $\Psi_{e_i}(x)$ where $e_i$ is a standard basis vector ($\text{deg}(e_i)=1$) only depends on feature $i$ of the input. It is equal to $+1$ when feature $i$ is zero and equal to $-1$ when feature $i$ is one. More generally, a degree $d$ basis function depends on exactly $d$ input features.  

Since the Fourier basis functions form a basis for the vector space of all pseudo-boolean functions, any function $g:\{0,1\}^n \rightarrow \R $ can be written as a unique linear combination of these basis functions:
\[
g(x) =  \frac{1}{\sqrt{2^n}} \sum\limits_{f \in \{0,1\}^n}  \widehat{g}(f)  (-1)^{\langle f, x \rangle}
\]
The (unique) coefficients $\widehat{g}(f)$ are called the ``Fourier coefficients'' or ``Fourier amplitudes'' and are computed as $
\widehat{g}(f) =  \frac{1}{\sqrt{2^n}} \sum\limits_{x \in \{0,1\}^n}  g(x)  (-1)^{\langle f, x \rangle}
$. 
The \emph{Fourier spectrum} of $g$ is the vector consisting of all of its $2^n$ Fourier coefficients, which we denote by the bold symbol $\widehat{\mathbf{g}}\in\mathbb{R}^{2^n}$. Assume $\mathbf{X}\in \{0,1\}^{2^n\times n}$ to be the matrix of an enumeration over all possible $n$-dimensional binary sequences ($\{0,1\}^n$), and $\mathbf{g}(\mathbf{X})\in \mathbb{R}^{2^n}$ to be the vector of $g$ evaluated on the rows of $\mathbf{X}$. We can compute the Fourier spectrum using Walsh-Hadamard transform as $
\widehat{\mathbf{g}} = \frac{1}{\sqrt{2^n}}\mathbf{H}_n \mathbf{g}(\mathbf{X})$, 
where $\mathbf{H}_n \in \{\pm1\}^{2^n\times 2^n}$ is the orthogonal Hadamard matrix (see Appendix~\appref{app:sec:walsh_hadamard}).

Lastly, we define the \emph{support} of $g$ as the set of frequencies with non-zero Fourier amplitudes $\text{supp}(g) := \{f \in \{0,1\}^n | \widehat{g}(f) \neq 0\}$. The function $g$ is called \emph{$k$-sparse} if $|\text{supp}(g)| \leq k $. The function $g$ is called \emph{of degree} $d$ if all frequencies in its support have degree at most $d$. 
% \vspace{-5mm}
\subsection{Spectral bias theory} 
% \vspace{-3mm}
The function that a ReLU neural network represents at initialization can be seen as a sample from a GP  $N(0,K)$ in the infinite width limit \citep{daniely2016toward, lee2018deep} (randomness is over the initialization of the weights and biases). The kernel $K$ of the GP is called the ``Conjugate Kernel'' \citep{daniely2016toward} or the ``nn-GP kernel'' \citep{lee2018deep}. Let the kernel Gram matrix $\mathcal{K}$ be formed by evaluating the kernel on the Boolean cube i.e. $\{0,1\}^n$ and let $\mathcal{K}$ have the following spectral decomposition: $\mathcal{K} = \sum\limits_{i=1}^{2^n} \lambda_i u_i u_i^\top$, where we assume that the eigenvalues $\lambda_1 \geq \dots \geq \lambda_{2^n}$ are in decreasing order. 
Each sample of the GP can be obtained as $
\sum\limits_{i=1}^{2^n} \lambda_i \bm{w_i} u_i, \bm{w_i} \sim \mathcal{N}(0,1)$. 
Say that $\lambda_1 \gg \sum_{i \geq 2} \lambda_i$. Then a sample from the GP will, roughly speaking, look very much like $u_1$. 

Let $u_f, f\in\{0,1\}^n$ be obtained by evaluating the Fourier basis function $\Psi_f$ at the $2^n$ possible inputs on $\{0,1\}^n$. \citet{yang_fine-grained_2020} show that $u_f$ is a eigenvector for $\mathcal{K}$. Moreover, they show (weak) spectral bias results in terms of the degree of $f$. Namely, the eigenvalues corresponding to higher degrees have smaller values \footnote{To be more precise, they show that the eigenvalues corresponding to even and odd degree frequencies form decreasing sequences. That is, even and odd degrees are considered separately.}. The result is \emph{weak} as they do not provide a \emph{rate} as to which the eigenvalues decrease with increasing degrees. Their results show that neural networks are similar to low-degree functions at initialization. 

Other works show that in infinite-width neural networks weights after training via (stochastic) gradient descent do not end up too far from the initialization \citep{chizat2019lazy, jacot_neural_2018, du2018gradient, allen2019convergence, allen2019convergencernn}, referred to as ``lazy training'' by \citet{chizat2019lazy}. \citet{lee2018deep, lee2019wide} show that training the last layer of a randomly initialized neural network via full batch gradient descent for an infinite amount of time corresponds to GP posterior inference with the kernel $K$. \citet{jacot_neural_2018, lee2019wide} proved that when training \emph{all} the layers of a neural network (not just the final layer), the evolution can be described by a kernel called the ``Neural Tangent Kernel'' and the trained network yields the mean prediction of GP $N(0, K_{NTK})$ \citep{yang_fine-grained_2020} after an infinite amount of time. \citet{yang_fine-grained_2020} again show that $u_f$ are eigenvectors and weak spectral bias holds. Furthermore, \citet{yang_fine-grained_2020} provides empirical results for the generalization of neural nets of different depths on datasets arising from $k=1$-sparse functions of varying degrees. 









%the eigenvalue decomposition of the kernel matrices $\mathcal{K}$ to indeed show the eigenvalues are strictly decreasing with an increasing degree. However, they do not take one step further to compute the Fourier transforms of any trained neural networks, which we show in our results. We do this because it is not obvious that in the non-infinite width setting, the NTK results hold, even though \citet{lee2019wide} does extensive experiments to show NTK holds for finite widths. 


% \paragraph{Compress Sensing.}
% * The relation to Compress Sensing in focusing on viewing the function from Fourier perspective
% * The importance of our method against compress sensing:
% 1) Oracle access
% 2) Tractability for higher-degree frequencies


% \paragraph{Simplicity bias.}

% * What has been observed in terms of simple functions first -> lower frequencies first
% * The downside of simplicity bias in not letting the model learn important higher frequencies
% * The importance of sparsity induction in learning a generalizable function

% \subsection{Sparsifying Fourier Spectrum of Neural Networks}
% The practice of sparsifying the Fourier spectrum of Neural Networks has been previously tested by \citet{aghazadeh_epistatic_2021}. They proposed two regularization techniques to do so. The first is simply penalizing the $L1$ norm of the exact Fourier spectrum of the network at each optimization iteration, named EN, which we discuss in section \ref{sec:en}. Given the impracticality of EN in larger spaces, they propose a scalable method by 

% * What's observed in biology:
% 1) many biological landscapes are sparse
% 2) there are important high-complexity relations to well-approximate the landscapes
% * Aghazadeh methods to fulfill this
% (* other Aghazadeh works on the sparsity dynamics and if possible low-data regime)


% \input{method}
% \vspace{-6mm}
\section{Low-degree spectral bias}\label{sec:method}
% \vspace{-3mm}
%For better readability, in this section, we refer to different regularization techniques by their name, and the Neural Network without any further regularizations as \emph{standard}.

In this section, we conduct experiments on synthetically generated datasets to show neural networks' spectral bias and their preference toward learning lower-degree functions over higher-degree ones. Firstly, we show that the neural network is not able to pick up the high-degree frequency components. Secondly, it can learn erroneous lower-degree frequency components. To address these issues, in Section \ref{sec:en}, we introduce our regularization scheme called \emph{\textsc{HashWH}} (Hashed Walsh Hadamard) and demonstrate how it can remedy both problems.



% The experiments are conducted in two different settings. One in a relatively low-dimension space to be able to finely monitor the behavior of different regularization schemes, and one in higher dimensions to highlight the power of our regularization method in real-world problems both in terms of computational complexity and also sparsifying the spectrum. 
\subsection{Fourier spectrum evolution} 
\begin{figure*}[!htb]
    \centering
    \begin{subfigure}[b]{0.75\linewidth}
         \centering
         \includegraphics[width=\linewidth]{plots/spectrum/data_freq_heatmap_n10_d5_size2.pdf}
         \caption{Target support}
         \label{fig:data_freq_heatmap}
     \end{subfigure}

     \begin{subfigure}[b]{0.9\linewidth}
         \centering
         \includegraphics[width=\linewidth]{plots/spectrum/sc_heatmap_n10_d5_size2_seed5.png}
         \caption{Whole Fourier spectrum}
         \label{fig:sc_heatmap}
     \end{subfigure}
    \caption{Evolution of the Fourier spectrum during training. \textsc{Standard} is the unregularized neural network. \textsc{FullWH} imposes $L_1$-norm regularization on the exact Fourier spectrum and is intractable. \textsc{EN-S} alternates between computing a sparse Fourier approximation (computationally very expensive) and regularization. \textsc{HashWH} (ours) imposes $L_1$ regularization on the hashed spectrum. Figure (a) is limited to the target support. The standard neural network is unable to learn higher degree frequencies. Our regularizer fixes this. Figure (b) is on the whole spectrum. The standard neural network picks up erroneous low-degree frequencies while not being able to learn the higher-degree frequencies. Our regularizer fixes both problems.}
\end{figure*}

We analyze the evolution of the function learned by neural networks during training. We train a neural network on a dataset arising from  a synthetically generated sparse function with a low-dimensional input domain. Since the input is low-dimensional it allows us to calculate the Fourier spectrum of the network (exactly) at the end of each epoch.
% We compare the evolution of the Fourier spectrum of Neural Networks trained with and without sparsity regularization. Our regularizations of interest are EN, EN-S with $m=4$, and \textsc{HashWH} with $b\in\{5, 7, 8\}$. For each model, we compute the output of the network on all $2^{10}$ possible inputs after each training epoch using which we compute the Fourier spectrum of the network.

\textbf{Setup.} 
Let $g^*:\{0, 1\}^{10} \rightarrow \mathbb{R}$ be a synthetic function with five frequencies in its support with degrees 1 to 5 ($\text{supp}(g^*)=\{f_1, f_2, f_3, f_4, f_5\}, \text{deg}(f_i)=i$), all having equal Fourier amplitudes of $\widehat{g}^*(f_i)=1$. Each $f_i$ is sampled uniformly at random from all possible frequencies of degree $i$. The training set is formed by drawing uniform samples from the Boolean cube $x\sim \mathcal{U}_{\{0,1\}^{10}}$ and evaluating $g^*(x)$.

% , i.e., random initialization of the network, random samples used in EN-S for sparse Fourier spectrum approximation, and sampling of hashing matrices in our method

We draw five such target functions $g^*$ (with random support frequencies). For each draw of the target function, we create five different datasets all with 200 training points and sampled uniformly from the input domain but with different random seeds. We then train a standard five-layer fully connected neural network using five different random seeds for the randomness in the training procedure (such as initialization weights and SGD). We aggregate the results over the $125$ experiments by averaging.
%to reduce the variance resulting from the randomness in the choice of $\text{supp}(g^*)$, training data, and training randomness. 
We experiment the same setting with three other training set sizes.
Results with training set size other than 200 and further setup details are reported in Appendices~\appref{app:subsec:evolution_detailed} and \appref{app:sec:technical_details}, respectively. % In all that follows, we refer to the $g^*$ used to generate each dataset as \emph{target} function, and the frequencies in its support as the target support.


\textbf{Results.} 
We first inspect the evolution of the learned Fourier spectrum over different epochs and limited to the target support ($\text{supp}(g^*)$). Figure~\ref{fig:data_freq_heatmap} shows the learned amplitudes for frequencies in the target support at each training epoch.
%averaged over the 125 aforementioned runs.
% (made possible due to always having a single frequency for each degree).
Aligned with the literature on simplicity bias \citep{valle2018deep,yang_fine-grained_2020}, we observe that neural networks learn the low-degree frequencies earlier in the epochs. Moreover, we can see in the left-most figure in Figure~\ref{fig:data_freq_heatmap} that despite eventually learning low-degree frequencies, the standard network is unable to learn high-degree frequencies.
% However, all sparsity-inducing regularization methods display better performance in learning high-degree frequencies. To the extent that EN is capable of perfectly learning all target frequencies. It can also be seen that increasing the size of the hashing matrix in \textsc{HashWH} boosts the learning of high-degree frequencies.

Next, we expand the investigation to the whole Fourier spectrum instead of just focusing on the support frequencies. The first row of Figure~\ref{fig:sc_heatmap} shows the evolution of the Fourier spectrum during training and compares it to the spectrum of the target function on the bottom row. We average the spectrum linked to one of the five target synthetic functions (over the randomness of the dataset sampling and training procedure) and report the other four in Appendix~\appref{app:subsec:evolution_detailed}. We observe that in addition to the network not being able to learn the high-degree frequencies, the standard network is prone to learning incorrect low-degree frequencies as well. 
%This is another artifact of the simplicity bias. 

% It can be seen that besides the better performance of the sparsity-inducing methods in learning the target frequencies, they are also better at filtering non-relevant frequencies. The standard model, however, has wrongly learned multiple low-degree frequencies over the course of training.

% \vspace{-3mm}
% \section{Sparsifying the Fourier spectrum} 
\section{Overcoming the spectral bias via regularization}
\label{sec:en}
% \vspace{-2mm}
Now, we introduce our regularization scheme \emph{\textsc{HashWH}} (Hashed Walsh-Hadamard). 
Our regularizer is essentially a ``sparsifier'' in the Fourier domain. That is, it guides the neural network to have a sparse Fourier spectrum. We empirically show later how sparsifying the Fourier spectrum can both stop the network from learning erroneous low-degree frequencies and aid it in learning the higher-degree ones, hence remedying the two aforementioned problems.

Assume $\mathcal{L}_{net}$ is the loss function that a standard neural network minimizes, e.g., the MSE loss in the above case.  We modify it by adding a regularization term $\lambda \mathcal{L}_{sparsity}$. Hence the total loss is given by: $\mathcal{L} = \mathcal{L}_{net} + \lambda \mathcal{L}_{sparsity}$.
% The sparsity loss $\mathcal{L}_{sparsity}$ should be lower when the Fourier spectrum of the network, $\widehat{g_\theta}$, is sparser. 

The most intuitive choice is $\mathcal{L}_{sparsity}=\|\widehat{\mathbf{g_N}}\|_0$, where $\widehat{\mathbf{g_N}}$ is the Fourier spectrum of the neural network function $g_N: \{0,1\}^n \rightarrow \R$. Since the $L_0$-penalty's derivative is zero almost everywhere, one can use its tightest convex relaxation, the $L_1$-norm, which is also sparsity-inducing, as a surrogate loss. \citet{aghazadeh_epistatic_2021} use this idea and name it as Epistatic-Net or ``EN'' regularization: $\mathcal{L}_{EN} := \mathcal{L}_{net} + \lambda \|\widehat{\mathbf{g_N}}\|_1\label{eq:EN_loss}$. In this work, we call this regularization \textsc{FullWH} (Full Walsh Hadamard transform).
% EN requires the computation of the network output on all $2^n$ possible inputs at each iteration of the back-propagation. Therefore, the computational complexity grows \emph{exponentially} with the number of dimensions $n$, making it computationally intractable for $n>20$. To avoid the burden of computing network output on all possible inputs, we employ a hashing technique to approximate the $\|\widehat{\mathbf{g_\theta}}\|_1$ term.

\textsc{FullWH} requires the evaluation of the network output on all $2^n$ possible inputs at each iteration of back-prop. Therefore, the computational complexity grows \emph{exponentially} with the number of dimensions $n$, making it computationally intractable for $n>20$ in all settings of practical importance.

\looseness -1 \citet{aghazadeh_epistatic_2021} also suggest a more scalable version of \textsc{FullWH}, called ``\textsc{EN-S}'',
%by considering  $\widehat{\mathbf{g_\theta}}$ as an explicit optimization constraint and decoupling the optimization of $\mathcal{L}_{EN}$ into two separate minimization problems of network optimization and Fourier spectrum sparsification followed by a dual update, using ADMM (\citet{boyd_distributed_2011}), which enabled them to iteratively approximate a sparse Fourier spectrum for the network at each epoch. 
which roughly speaking, alternates between computing the sparse \emph{approximate} Fourier transform of the network at the end of each epoch and doing normal back-prop, as opposed to the exact computation of the exact Fourier spectrum when back-propagating the gradients. 
In our experiments, we show \textsc{EN-S} can be computationally expensive because the sparse Fourier approximation primitive can be time-consuming. For a comprehensive comparison see Appendix \appref{app:subsec:EN-S}. Later, we show that empirically, it is also less effective in overcoming the spectral bias as measured by achievable final generalization error.

%where $m$ is  the hash size of the Fourier approximation algorithm used.


% \vspace{-5mm}
\subsection{\textsc{HashWH}}
\label{subsec:Hashwh}
% \vspace{-3mm}
We avoid the exponentially complex burden of computing the exact Fourier spectrum of the network by employing a hashing technique to approximate the regularization term $\lambda \|\widehat{\mathbf{g_N}}\|_1$.
Let $g:\{0, 1\}^n \rightarrow \mathbb{R}$ be a pseudo-boolean function. We define the lower dimensional function $u_{\mathbf{\sigma}}: \{0, 1\}^b \rightarrow \mathbb{R}$, where $b \ll n$,  by sub-sampling $g$ on its domain: $u_\mathbf{\sigma}(\tilde{x}) \triangleq \sqrt{\frac{2^n}{2^b}} \ g(\mathbf{\sigma} \tilde{x}), \ \tilde{x} \in \{0,1\}^b$
where $\mathbf{\sigma} \in \{0,1\}^{n \times b}$ is some matrix which we call the \emph{hashing matrix}. The matrix-vector multiplication $\sigma \tilde{x}$is taken modulo 2. $u_\sigma$ is defined by sub-sampling $g$ on all the points lying on the (at most) $b$-dimensional subspace spanned by the columns of the hashing matrix $\sigma$. The special property of sub-sampling the input space from this subspace is in the arising Fourier transform of $u_\sigma$ which we will explain next. 

The Fourier transform of $u_{\mathbf{\sigma}}$ can be derived as (see Appendix~\appref{app:subsec:hash_sum_proof}):
\begin{align}
    \widehat{u}_\mathbf{\sigma}(\Tilde{f}) = \sum_{f \in \{0,1\}^n:\ \mathbf{\sigma}^\top f= \Tilde{f}} \widehat{g}(f), \ \Tilde{f} \in \{0,1\}^b
    \label{eq:hash_sum}
\end{align}
% \todo{Prove it in Appendix}
One can view $\widehat{u}_\mathbf{\sigma}(\Tilde{f})$ as a ``bucket'' containing the sum of all Fourier coefficients $\widehat{g}(\tilde{f})$ that are ``hashed'' (mapped) into  it by the linear hashing function $h(f) = \sigma^\top f$. There are $2^b$ such buckets and each bucket contains frequencies lying in the kernel (null space) of the hashing map plus some shift. 
% , the Fourier transform of which on an arbitrary bucket is the sum of original Fourier coefficients of frequencies hashed into it. 

In practice, we let $\mathbf{\sigma} \sim \mathcal{U}_{\{0,1\}^{n \times b}}$ be a uniformly sampled hash matrix that is re-sampled after each iteration of back-prop. Let $\mathbf{X}_b \in \{0, 1\}^{2^b \times b}$ be a matrix containing as rows the enumeration over all points on the Boolean cube $\{0, 1\}^b$. Our regularization term approximates \eqref{eq:EN_loss} and is given by:
\begin{equation}
    \mathcal{L}_{\textsc{HashWH}} 
    \triangleq \mathcal{L}_{net} + \lambda \| \mathbf{H}_b \mathbf{g_N}(\mathbf{X}_b \mathbf{\sigma}^T) \|_1
    = \mathcal{L}_{net} + \lambda \|\widehat{\mathbf{u_\mathbf{\sigma}}}\|_1 \nonumber 
    \label{eq:HashWH_loss}
\end{equation}
\looseness -1 That is, instead of imposing the $L_1$-norm directly on the whole spectrum, this procedure imposes the norm on the ``bucketed'' (or partitioned) spectrum where each bucket (partition) contains sums of coefficients mapped to it. The larger $b$ is the more partitions we have and the finer-grained the sparsity-inducing procedure is. Therefore, the quality of the approximation can be controlled by the choice of $b$. Larger $b$ allows for a finer-grained regularization but, of course, comes at a higher computational cost because a Walsh-Hadamard transform is computed for a higher dimensional sub-sampled function $u$. Note that $b=n$ corresponds to hashing to $2^n$ buckets. As long as the hashing matrix is invertible, this precisely is the case of \textsc{FullWH} regularization. 

The problem with the above procedure arises when, for example, two ``important'' frequencies $f_1$ and $f_2$ are hashed into the same bucket, i.e., $\mathbf{\sigma}^\top f_1 = \mathbf{\sigma}^\top f_2$, an event which we call a ``collision''. This can be problematic when the absolute values $|\widehat{g}(f_1)|$ and $|\widehat{g}(f_2)|$ are large (hence they are important frequencies) but their sum can cancel out due to differing signs. In this case, the hashing procedure can zero out the sum of these coefficients. We can reduce the probability of a collision by increasing the number of buckets, i.e., increasing $b$ \citep{alon1999linear}. 

In Appendix~\appref{app:subsec:collision_probability} we show that the expected number of collisions $C$ is given by:
$\mathbb{E}[C]=\frac{(k-1)^2}{2^{b}}$ which decreases linearly with  the number of buckets $2^b$. Furthermore, we can upper bound the probability $p$ that a given important frequency $f_i$ collides with any other of the $k-1$ important frequencies in one round of hashing. Since we are independently sampling a new hashing matrix $\mathbf{\sigma}$ at each round of back-prop, the number of collisions of a given frequency over the different rounds has a binomial distribution. In Appendix~\appref{app:subsec:collision_probability} we show that picking $b \geq \log_2(\frac{k-1}{\epsilon}), \epsilon>0$ guarantees that collision of a given frequency happens approx.~an $\epsilon$-fraction of the $T$ rounds, and not much more.  



% The total number of collisions $C$ can be bounded if $g$ is $k$-sparse and the matrix $\mathbf{\sigma}$ is chosen uniformly at random (see section \ref{sup:sec:\textsc{HashWH}_details} in the Appendix for the proof):
% they are hashed into the same bucket  is $k$-sparse and the matrix $\mathbf{\sigma}$ is chosen uniformly at random (see Appendix \ref{sup:sec:\textsc{HashWH}_details} for the proof):
% \[
% \mathbb{E}(C)\leq \frac{k^2}{2^b}
% \]
% For each bucket $f_u \in \{0, 1\}^b$ with only one frequency $f \in \text{supp}(g_\theta)$ hashed into, $\widehat{u}_\mathbf{\sigma}(f_u)=\widehat{g_\theta}(f)$. Therefore, we can approximate the non-zero part of $\widehat{\mathbf{g_\theta}}$ using $\widehat{\mathbf{u_\mathbf{\sigma}}}$, where the lower number of collisions leads to a better approximation. In case of no collisions, $\|\widehat{\mathbf{u_\mathbf{\sigma}}}\|_1=\|\widehat{\mathbf{g_\theta}}\|_1$, which has the probability of $P(C=0)\geq 1-\frac{k^2}{2^b}$ (Markov's inequality).

% Therefore, lower number of collisions leads to better approximation of the non-zero part of $\widehat{\mathbf{g}}$ using $\widehat{\mathbf{u_\mathbf{\sigma}}}$, where $\|\widehat{\mathbf{u_\mathbf{\sigma}}}\|_1=\|\widehat{\mathbf{g}}\|_1$ with the probability of $P(C=0)\geq 1-\frac{k^2}{2^b}$ (Markov's inequality).

% It can be shown that if $g$ is $k$-sparse and the matrix $\mathbf{\sigma}$ is chosen uniformly, the probability of hashing more than two frequencies of support $g$ into one bucket can get arbitrarily small by increasing the hashing parameter $b$.

% If the hashing matrix $\mathbf{\sigma}$ hashes each frequency in $supp(g)$ into a unique bucket, i.e. there is no ``collision'' in the hashing of $supp(g)$, the non-zero Fourier spectrum of $u_\mathbf{\sigma}$ and $g$ would be similar.
% It can be shown that 


%We sample $\mathbf{\sigma}$ at each iteration to ensure a statically robust mechanism. With \textsc{HashWH}, we scale the number of samples needed to calculate the regularization term at each optimization iteration to $2^b$, compared to the $2^n$ samples needed for EN. This lets our method scale given the desired approximation guarantee and available resources.

% With this approximation, we reduce the number of samples needed to calculate the regularization term to $2^b$, compared to the $2^n$ samples needed to compute the Fourier spectrum in EN. Furthermore, the size of the hashing space can be scaled given the problem and available resources.

% * Write something about EN-S and compute an approximation for b based on EN-S equivalent ($2^m*d*3*n \rightarrow b \geq m + 3 + logn$)

\textbf{Fourier spectrum evolution of different regularization methods.} 
We analyze the effect of regularizing the network with various Fourier sparsity regularizers in the setting of the previous section. 
%monitoring the evolution of the Fourier spectrum using the same datasets and 
%random seeding strategy. 
Our regularizers of interest are \textsc{FullWH}, \textsc{EN-S} with $m=5$ ($2^m$ is the number of buckets their sparse Fourier approximation algorithm hashes into), and \textsc{HashWH} with $b \in \{5, 7, 8\}$.  

Returning to Figure~\ref{fig:data_freq_heatmap}, we see that despite the inability of the standard neural network in picking up the \emph{high-degree} frequencies, all sparsity-inducing regularization methods display the capacity for learning them. \textsc{FullWH} is capable of perfectly learning the entire target support. It can also be seen that increasing the size of the hashing matrix in \textsc{HashWH} (ours) boosts the learning of high-degree frequencies. Furthermore, Figure~\ref{fig:sc_heatmap} shows that in addition to the better performance of the sparsity-inducing methods in learning the target support, they are also better at filtering out non-relevant \emph{low-degree} frequencies. 
% This suggests that enforcing a sparse Fourier spectrum during training can help the network learn relevant high-degree frequencies.

We define a notion of approximation error which is basically the normalized energy of the error in the learned Fourier spectrum on an arbitrary subset of frequencies.
% \vspace{-1mm}
\begin{metric}[Spectral Approximation Error (SAE)]
Let $g_N: \{0, 1\}^n\rightarrow \mathbb{R}$ be an approximation of the target function $g^*: \{0, 1\}^n\rightarrow \mathbb{R}$. Consider a subset of frequencies $S \subseteq \{0,1\}^n$, and assume $\widehat{\mathbf{g_N}}_S$ and $\widehat{\mathbf{g^*}}_S$ to be the vector of Fourier coefficients of frequencies in $S$, for $g_N$ and $g^*$ respectively. As a measure of the distance between $g_N$ and $g$ on the subset of frequencies $S$, we define Spectral Approximation Error as: $
    \text{SAE} = \frac{\|\widehat{\mathbf{g_N}}_S - \widehat{\mathbf{g^*}}_S\|^2_2}{\|\widehat{\mathbf{g^* }}_S\|^2_2} % \label{eq:function_error}
$
% which is the normalized energy of the difference over frequencies in $S$. In the case of $S=\{0,1\}^n$, Function Error is the normalized energy of the difference in the whole Fourier spectrum.
% \vspace{-4mm}
\end{metric}
\begin{figure}
  \centering
  \includegraphics[width=\linewidth]{plots/spectrum/function_error_n10_d5_size2.pdf}
  % \vspace*{-8mm}
  \caption{\looseness -1 Evolution of the spectral approximation error during training. The left plot limits the error to the target support, while the right one considers the whole Fourier spectrum. For the standard neural network, the SAE is considerably worse on the full spectrum which shows the importance of eliminating the erroneous frequencies that are not in the support of the target function. We also see the graceful scaling of SAE of \textsc{HashWH} (ours) with the hashing matrix size.}
  % \vspace*{0mm}
  \label{fig:function_error}
\end{figure}

Figure \ref{fig:function_error} shows the SAE of the trained network using different regularization methods over epochs, for both when $S$ is target support as well as when $S=\{0,1\}^n$ (whole Fourier spectrum). 
% The standard network achieves an acceptable SAE but is still worse than \textsc{HashWH} (ours) on the target support (in all cases). However, the SAE is considerably worse for the standard neural net when taken on the full spectrum. 
The standard network displays a significantly higher (worse) SAE on the whole Fourier spectrum compared to the target support, while Walsh-Hadamard regularizers exhibit consistent performance across both.
This shows the importance of enforcing the neural network to have zero Fourier coefficients on the non-target frequencies. Moreover, we can see \textsc{HashWH} (ours) leads to a reduction in SAE that can be smoothly controlled by the size of its hashing matrix. 

To gain more insight, we split the frequencies into subsets $S$ consisting of frequencies with the same degree. We visualize the evolution of SAE and also the Fourier energy of the network  defined as $\|\widehat{\mathbf{g_N}}_S\|_2^2$ in Figure~\ref{fig:degree_split_function_error}. Firstly, the energy of high-degree frequencies is essentially zero for the standard neural network when compared to the low-degree frequencies, which further substantiates the claim that  standard neural network training does not learn any high-degree frequencies. We can see that our \textsc{HashWH} regularization scheme helps the neural network learn higher degree frequencies as there is more energy in the high degree components. Secondly, looking at the lower degrees 2 and 3 we can see that the standard neural network reduces the SAE up to some point but then starts overfitting. Looking at the energy plot one can attribute the overfitting to picking up irrelevant degree 2 and 3 frequencies. We see that the regularization scheme helps prevent the neural net from overfitting on the low-degree frequencies and their SAE reduces roughly monotonously. We observe that \textsc{HashWH} (ours) with a big enough hashing matrix size exhibits the best performance among tractable methods in terms of SAE on all degrees. Finally, we can see \textsc{HashWH} is distributing the energy to where it should be for this dataset: less in the low-degree and more in the high-degree frequencies. 


\begin{figure*}[!htb]
  \centering
  \includegraphics[width=0.9\linewidth]{plots/spectrum/degree_split_function_error_n10_d5_size2.pdf}
    % \vspace*{-3mm}
  \caption{Evolution of the Spectral Approximation Error (SAE) and energy of the network during training, split by frequency degree. Firstly, in a standard neural network, the energy of high-degree frequencies is essentially zero compared to low-degree frequencies. Secondly, for low degrees (2 and 3) the energy continues to increase while the SAE exhibits overfitting behavior. This implies the neural network starts learning erroneous low-degree frequencies after some epochs. Our regularizer prevents overfitting in lower degrees and enforces higher energy on higher-degree frequencies. Regularized networks show lower energies for lower degrees and higher energy for higher degrees when compared to the standard neural network.
}
    % \vspace*{0mm}
  \label{fig:degree_split_function_error}
\end{figure*}


% To summarise, we observe that, unlike the common belief that Neural Networks are learning high-degree noises when overfitting, they tend to find a local optimum by learning simpler low-degree noises. On the contrary, the practice of inducing sparsity in the Fourier spectrum demonstrates significant potential for saving the model from this phenomenon and letting it learn relevant high-degree frequencies.
Finally, it is worth noting that our regularizer makes the neural network behave more like a \emph{decision tree}. It is well known that ensembles of decision tree models have a sparse and low-degree Fourier transform \citep{kushilevitz1991learning}. Namely, let $g:\{0,1\}^n \rightarrow \R$ be a function that can be represented as an ensemble of $T$ trees each of depth at most $d$. Then $g$ is $k=O(T \cdot 4^d)$-sparse and of degree at most $d$ (Appendix~\appref{app:subsec:tree_fourier}). Importantly, their spectrum is \emph{exactly sparse} and unlike standard neural networks, which seem to ``fill up'' the spectrum on the low-degree end, i.e., learn irrelevant low-degree coefficients, decision trees avoid this. Decision trees are well-known to be effective on discrete/tabular data \citep{tabnet}, and our regularizer prunes the spectrum of the neural network so it behaves similarly.

% \input{experiments}
% \vspace{-5mm}
\section{Experiments}\label{sec:experiments}
% \vspace{-3mm}
In this section, we first evaluate our regularization method on higher dimensional input spaces (higher $n$) on synthetically generated datasets. In this setting, \textsc{FullWH} is not applicable due to its exponential runtime in $n$. In addition, we allow varying training set sizes to showcase the efficacy of the regularizer in improving generalization at varying levels in terms of the number of training points in the dataset and especially in the low-data sample regime. Next, we move on to four real-world datasets. We first show the efficacy of our proposed regularizer \textsc{hashWH} on real-world datasets in terms of achieving better generalization errors, especially in the low-data sample regimes. Finally, using an ablation study, we experimentally convey that the low-degree bias does not result in lower generalization error. 

% In all of the experiments, we used three 

% \subsection{Synthetic data} 
% \begin{figure*}
%     \centering
%      \begin{subfigure}[b]{0.3\linewidth}
%          \centering
%          \includegraphics[width=\linewidth]{plots/synthetic_large/performance_n25_seed3.pdf}
%          \caption{n=25}
%          \label{fig:synthetic_large_25}
%      \end{subfigure}
%      \begin{subfigure}[b]{0.3\linewidth}
%          \centering
%          \includegraphics[width=\linewidth]{plots/synthetic_large/performance_n50_seed3.pdf}
%          \caption{n=50}
%          \label{fig:synthetic_large_50}
%      \end{subfigure}
%           \begin{subfigure}[b]{0.3\linewidth}
%          \centering
%          \includegraphics[width=\linewidth]{plots/synthetic_large/performance_n100_seed3.pdf}
%          \caption{n=100}
%          \label{fig:synthetic_large_100}
%      \end{subfigure}
%     \caption{Generalization perfromance of learning a synthetic function $g^*:\{0,1\}^n \rightarrow \mathbb{R}$ with $c \cdot 25n$ samples. $R^2$s are averaged over $25$ runs and error bars represent the one standard error deviation. The x-axis is the sample size multiplier $c$. The y-axis is the $R^2$ computed on a hold-out test set not used in the training procedure. We provide significant improvements across all training sizes over EN-S and standard neural networks}
%     \label{fig:synthetic_large}
% \end{figure*}

% \begin{figure}
%     \centering
%      \includegraphics[width=\linewidth]{plots/synthetic_large/triple_runtime_n50_size5_seed3.pdf}
%     \caption{Best achievable generalization performance $R^2$ up to that point in time where ``time'' is (a) epoch (b)training time (seconds). (c) Shows the best achievable $R^2$ before overfitting plotted against the time it took to reach that point. The x-axis is logarithmically scaled for the training time. We are providing an order of magnitude speed-up compared to EN-S}\label{fig:runtime_synthetic_large}
% \end{figure}


\begin{figure*}[!htb]
    \centering
     \begin{subfigure}[b]{0.54\linewidth}
         \centering
         \includegraphics[width=\linewidth]{plots/synthetic_large/column_performance_seed3.pdf}
         \caption{Generalization comparison}
         \label{fig:synthetic_large}
     \end{subfigure}
     \begin{subfigure}[b]{0.455\linewidth}
         \centering
         \includegraphics[width=\linewidth]{plots/synthetic_large/triple_runtime_n50_size5_seed3.pdf}
        \caption{Runtime comparison}
         \label{fig:runtime_synthetic_large}
     \end{subfigure}
    \caption{(a) Generalization performance on learning a synthetic function $g^*:\{0,1\}^n \rightarrow \mathbb{R}$ with train set size: $c \cdot 25n$ (b) Best achievable test $R^2$ (\RomanNumeralCaps 1) at end of each epoch (\RomanNumeralCaps 2) up to a certain time (seconds). (\RomanNumeralCaps 3) Shows the early stopped $R^2$ score vs. time (seconds). We provide significant improvements across all training sizes over \textsc{EN-S} and standard neural networks, while also showing an order of magnitude speed-up compared to \textsc{EN-S}.}
\end{figure*}

\begin{figure*}[!htb]
    \centering
     \begin{subfigure}[b]{\linewidth}
         \centering
         \includegraphics[width=\linewidth]{plots/merged_real_data_score.pdf}
         % \vspace*{-7mm}
         \caption{Performance on learning real datasets}
         % \vspace*{3mm}
         \hspace*{\fill}
         \label{fig:real_data_score}
     \end{subfigure}
     \begin{subfigure}[b]{0.33\linewidth}
         \centering
         \includegraphics[width=\linewidth]{plots/GB1_reduced_runtime.pdf}
         \caption{Runtimes for GB1}
         \label{fig:gb1_runtime}
     \end{subfigure}
         \begin{subfigure}[b]{0.41\linewidth}
         \centering
         \includegraphics[width=\linewidth]{plots/ablation/Entacmaea_SGEMM.pdf}
         \caption{Ablation studies}
         \label{fig:ablation_dual}
     \end{subfigure}
    \begin{subfigure}[b]{0.25\linewidth}
         \centering
         \includegraphics[width=\linewidth]{plots/Entacmaea_energy.pdf}
         \caption{Energy distribution over \\degrees in Entacmaea}
         \label{fig:entacmaea_energy}
     \end{subfigure}
     % \vspace*{-4mm}
    \caption{(a) Generalization performance of standard and regularized neural networks and benchmark ML models on four real datasets. (b) Training times of different models on the GB1 dataset (c) Results of an ablation study on the potential effect of simplicity bias in the generalization error. This figure shows picking higher amplitude coefficients results in better generalization compared to picking the lower degree terms (d) Distribution of the energy over degree-based sets of frequencies in Entacmaea's top 100 Fourier coefficients. This shows high-degree components constitute a non-negligible portion of the energy of the function.
}
\label{fig:score_real_data}
\end{figure*}



% \vspace{-5mm}
\subsection{Synthetic data} 
% \vspace{-2mm}
\textbf{Setup.}
Again, we consider a synthetic pseudo-boolean target function $g^*:\{0,1\}^n \rightarrow \mathbb{R}$, which has $25$ frequencies in its support $|\text{supp}(g^*)|=25$, with the degree of maximum five, i.e., $\forall f \in \text{supp}(g^*): \text{deg}(f)\leq5$. To draw a $g^*$, we sample each of its support frequencies $f_i$ by first uniformly sampling its degree $d \sim \mathcal{U}_{\{1,2,3,4,5\}}$, based on which we then sample $f_i \sim \{f\in\{0,1\}^n|\text{deg}(f)=d \}$ and its corresponding amplitude uniformly $\widehat{g^*}(f_i) \sim \mathcal{U}_{[-1, 1]}$.

We draw $g^*$ as above for different input dimensions $n\in\{25,50,100\}$. We pick points uniformly at random from the input domain $\{0,1\}^n$ and evaluate $g^*$ to generate datasets of various sizes: we generate five independently sampled datasets of size $c \cdot 25n$, for different multipliers $c\in \{1,..,8\}$ (40 datasets for each $g^*$).
We train a 5-layer fully-connected neural network on each dataset using five different random seeds to account for the randomness in the training procedure. Therefore, for each $g^*$ and dataset size, we train and average over 25 models to capture variance arising from the dataset generation, and also the training procedure. 
%More results are in Appendix~\appref{app:subsec:synthetic_detailed}. 

\textbf{Results.} 
Figure \ref{fig:synthetic_large} shows the generalization performance of different methods in terms of their $R^2$ score on a hold-out dataset (details of dataset splits in Appendix~\appref{app:sec:technical_details}) for different dataset sizes. Our regularization method, HashWH, outperforms the standard network and \textsc{EN-S} in all possible combinations of input dimension, and dataset size. Here, EN-S does not show any significant improvements over the standard neural network, while \textsc{HashWH} (ours) improves generalization by a large margin. Moreover, its performance is tunable via the hashing matrix size $b$. 

To stress the computational scalability of \textsc{HashWH} (ours), Figure \ref{fig:runtime_synthetic_large} shows the achievable $R^2$-score by the number of training epochs and training time for different methods, when $n=50$ and $c=5$ (see Appendix~\appref{app:subsec:synthetic_detailed} for other settings). The trade-off between the training time and generalization can be directly controlled with the choice of the hashing size $b$. More importantly, comparing \textsc{HashWH} with \textsc{EN-S}, we see that for any given $R^2$ we have runtimes that are orders of magnitude smaller. This is primarily due to the very time-consuming  approximation of the Fourier transform of the network at each epoch in \textsc{EN-S}.

% \vspace{-3mm}
\subsection{Real data}
\label{sec:exp:real_data}
% \vspace{-2mm}

% (e) Distribution of energy in top 100 Fourier coefficients of Entacmaea dataset over the split of Frequencies based on degree.}
    
Next, we assess the performance of our regularization method on four different real-world datasets of varying nature and dimensionality. For baselines, we include not only standard neural networks and EN-S regularization, but also other popular machine learning methods that work well on discrete data, such as ensembles of trees. Three of our datasets are related to protein landscapes \citep{poelwijk_learning_2019,  sarkisyan_local_2016, wu_adaptation_2016} which are identical to the ones used by the proposers of \textsc{EN-S} \citep{aghazadeh_epistatic_2021}, and one is a GPU-tuning \citep{nugteren_cltune_2015} dataset. See Appendix~\appref{app:sec:datasets} for dataset details.
% We first describe the datasets used and then discuss the results.


\textbf{Results.}
Figure~\ref{fig:real_data_score} displays the generalization performance of different models in learning the four datasets mentioned, using training sets of small sizes. For each given dataset size we randomly sample the original dataset with five different random seeds to account for the randomness of the dataset sub-sampling. Next, we fit five models with different random seeds to account for the  randomness of the training procedure. One standard deviation error bars and averages are plotted accordingly over the 25 runs. It can be seen that our regularization method significantly outperforms the standard neural network as well as popular baseline methods on nearly all datasets and dataset sizes. The margin, however, is somewhat smaller than on the synthetic experiments in some cases. This may be partially explained by the distribution of energy in a real dataset (Figure \ref{fig:entacmaea_energy}), compared to the uniform distribution of energy over different degrees in our synthetic setting.

To highlight the importance of higher degree frequencies, we compute the exact Fourier spectrum of the Entacmaea dataset (which is possible, since all possible input combinations are evaluated in the dataset). 
Figure \ref{fig:entacmaea_energy} shows the energy of 100 frequencies with the highest amplitude (out of 8192 total frequencies) categorized into varying degrees. This shows that the energy of the higher degree frequencies 3 and 4 is comparable to frequencies of degree 1. However, as we showed in the previous section, the standard neural network may not be able to pick up the higher degree frequencies due to its simplicity bias (while also learning erroneous low-degree frequencies).

We also study the relationship between the low-degree spectral bias and generalization in Figure~\ref{fig:ablation_dual}. The study is conducted on the two datasets ``Entacmaea'' and ``SGEMM''. We first fit a sparse Fourier function to our training data (see Appendix~\appref{app:sec:ablation_details}). We then start deleting coefficients once according to their degree (highest to lowest and ties are broken randomly) and in another setting according to their amplitude  (lowest to highest). To assess generalization, we evaluate the $R^2$ of the resulting function on a hold-out (test) dataset. This study shows that among functions of equal complexity (in terms of size of support), functions that keep the higher amplitude frequencies as opposed to ones that keep the low-degree ones exhibit better generalization. This might seem evident according to Parseval's identity, which states that time energy and Fourier energy of a function are equal.  However, considering the fact that the dataset distribution is not necessarily uniform, there is no reason for this to hold in practice. Furthermore, it shows the importance of our regularization scheme: deviating from low-degree functions and instead aiding the neural network to learn higher amplitude coefficients \emph{regardless} of the degree. 

\textbf{Conclusion} We showed through extensive experiments how neural networks have a tendency to not learn high-degree frequencies and overfit in the low-degree part of the spectrum. We proposed a computationally efficient regularizer that aids the network in not overfitting in the low-degree frequencies and also picking up the high-degree frequencies. Finally, we exhibited significant improvements in terms of $R^2$ score on four real-world datasets compared to various popular models in the low-data regime.

% This is most probably due to the assumption of uniformity of the data distribution in our method; a typical assumption in the Compressed Sensing literature. Despite the uniform sampling done in our synthetic setting, using real data distribution can decrease the quality of the Fourier transform computed hence the gain from the regularization.



% \begin{contributions} % will be removed in pdf for initial submission,
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions.
%     This is a nice way of making clear who did what and to give proper credit.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
This research was supported in part by the NCCR Catalysis (grant number 180544), a National Centre of Competence in Research funded by the Swiss National Science Foundation.  We would also like to thank Lars Lorch and Viacheslav Borovitskiy for their detailed and valuable feedback in writing the paper.

\end{acknowledgements}

% \clearpage

\bibliography{gorji_368}
% \clearpage
% \newpage
% \setlength{\belowcaptionskip}{1mm}
% \appendix
% \input{appendix}
\end{document}
