\documentclass[pmlr]{jmlr}% new name PMLR (Proceedings of Machine Learning Research)

 % The following packages will be automatically loaded:
 % amsmath, amssymb, natbib, graphicx, url, algorithm2e

 %\usepackage{rotating}% for sideways figures and tables
\usepackage{longtable}% for long tables

 % The booktabs package is used by this sample document
 % (it provides \toprule, \midrule and \bottomrule).
 % Remove the next line if you don't require it.
\usepackage{booktabs}
 % The siunitx package is used by this sample document
 % to align numbers in a column by their decimal point.
 % Remove the next line if you don't require it.
\usepackage[load-configurations=version-1]{siunitx} % newer version
 %\usepackage{siunitx}
 \usepackage{amssymb}
 \usepackage{array}
 \usepackage{tabularray}
\UseTblrLibrary{booktabs} 

 % The following command is just for this sample document:
\newcommand{\cs}[1]{\texttt{\char`\\#1}}

 % Define an unnumbered theorem just for this sample document:
\theorembodyfont{\upshape}
\theoremheaderfont{\scshape}
\theorempostheader{:}
\theoremsep{\newline}
\newtheorem*{note}{Note}

 % change the arguments, as appropriate, in the following:
\jmlrvolume{1}
\jmlryear{2023}
\jmlrworkshop{NeurIPS 2023 Gaze Meets ML Workshop}

\title[StatTexNet: Evaluating Peripheral Statistics]{StatTexNet: 
Evaluating the Importance of Statistical Parameters for Pyramid-Based Texture and Peripheral Vision Models}

 % Use \Name{Author Name} to specify the name.

 % Spaces are used to separate forenames from the surname so that
 % the surnames can be picked up for the page header and copyright footer.
 
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % *** Make sure there's no spurious space before \nametag ***

 % Two authors with the same address
%\author{Authors Anonymous For Review}
\author{\Name{C. Koevesdi} \Email{koevesdc@mit.edu} \\
\Name{V. DuTell} \Email{vasha@mit.edu}\\
\Name{A. Harrington} \Email{annekh@mit.edu}\\
\Name{M. Hamilton} \Email{markth@mit.edu}\\
\Name{W. T. Freeman} \Email{billf@mit.edu}\\
\Name{R. Rosenholtz} \Email{rruth@mit.edu}\\
\addr MIT CSAIL, Brain and Cognitive Sciences}
%}


 % Three or more authors with the same address:
 % \author{\Name{Author Name1} \Email{an1@sample.com}\\
 %  \Name{Author Name2} \Email{an2@sample.com}\\
 %  \Name{Author Name3} \Email{an3@sample.com}\\
 %  \Name{Author Name4} \Email{an4@sample.com}\\
 %  \Name{Author Name5} \Email{an5@sample.com}\\
 %  \Name{Author Name6} \Email{an6@sample.com}\\
 %  \Name{Author Name7} \Email{an7@sample.com}\\
 %  \Name{Author Name8} \Email{an8@sample.com}\\
 %  \Name{Author Name9} \Email{an9@sample.com}\\
 %  \Name{Author Name10} \Email{an10@sample.com}\\
 %  \Name{Author Name11} \Email{an11@sample.com}\\
 %  \Name{Author Name12} \Email{an12@sample.com}\\
 %  \Name{Author Name13} \Email{an13@sample.com}\\
 %  \Name{Author Name14} \Email{an14@sample.com}\\
 %  \addr Address}


 % Authors with different addresses:
 % \author{\Name{Author Name1} \Email{abc@sample.com}\\
 % \addr Address 1
 % \AND
 % \Name{Author Name2} \Email{xyz@sample.com}\\
 % \addr Address 2
 %}

% \editor{Editor's name}
 % \editors{List of editors' names}

\begin{document}

\maketitle

\begin{abstract}
Peripheral vision plays an important role in human vision, directing where and when to make saccades. Although human behavior in the periphery is well-predicted by pyramid-based texture models, these approaches rely on hand-picked image statistics that are still insufficient to capture a wide variety of textures. To develop a more principled approach to statistic selection for texture-based models of peripheral vision, we develop a self-supervised machine learning model to determine what set of statistics are most important for representing texture. Our model, which we call StatTexNet, uses contrastive learning to take a large set of statistics and compress them to a smaller set that best represents texture families. We validate our method using depleted texture images where the constituent statistics are already known. We then use StatTexNet to determine the most and least important statistics for natural (non-depleted) texture images using weight interpretability metrics, finding these to be consistent with previous psychophysical studies. Finally, we demonstrate that textures are most effectively synthesized with the statistics identified as important; we see noticeable deterioration when excluding the most important statistics, but minimal effects when excluding least important. Overall, we develop a machine learning method of selecting statistics that can be used to create better peripheral vision models. With these better models, we can more effectively understand the effects of peripheral vision in human gaze.
\end{abstract}
\begin{keywords}
peripheral vision, texture synthesis, multi-scale pyramid, statistic selection, contrastive learning
\end{keywords}

\section{Introduction}
\label{sec:intro}


%Leveraging gaze in Machine Learning models is a growing area of interest both for modeling and predicting human behavior [cite]\citep{Yang_2020_CVPR}, as well as in improving efficiency [cite] and performance [cite] of image and video models. This has been driven by not only success of biologically-inspired models, but also by strong human behavioral and neuroscience evidence showing peripheral vision as a driver of both gaze behavior and perception \citep{rosenholtz2016capabilities}.

%However, most deep learning models that incorporate gaze over-simply peripheral vision as loss of resolution at the photoreceptor layer of the retina \citep{min2022peripheral}, often by simply downsampling or Gaussian blurring \citep{pramod2022human,tiezzi2022foveated}. This ignores the majority of information loss of peripheral vision, which occurs downstream in the brain [cite]. Ignoring the complex transformation of peripheral encoding in the brain limits the utility of gaze models ...

A key source of information in human gaze comes from peripheral vision. While it is often thought of as an adaptation to capacity limits of the human visual system, peripheral vision also drives human performance on many visual tasks -- including search, scene perception, and object detection \citep{ehinger2016general}. With respect to gaze specifically, peripheral vision plays a role in saccadic planning by helping humans determine where to look next \citep{schutz2011eye}.

Given its importance in understanding human gaze patterns, numerous attempts have been made to model peripheral vision. Multi-scale-pyramid-based models are the current state of the art. These models account for both the loss of photoreceptor density and the summarization of information thought to occur in brain areas V2 and V3. Models such as these treat peripheral vision as a texture-like representation and have a long history in human vision. They have been used in not only peripheral vision, but also in texture models more generally \citep{portilla2000parametric}. To simulate peripheral vision, these models utilize overlapping pooling regions that encircle the fovea and increase in size with eccentricity. While some models utilize machine learning techniques like style transfer \citep{wallis2017towards,deza2017towards} to summarize information, the majority of these models calculate summary statistics for each pooling region, which are calculated on the output of multi-scale pyramids.

One challenge of pyramid-based models of peripheral vision is in determining which statistics are calculated in each pooling region. Although most pyramid-based texture models used to study peripheral vision have been validated through human behavioral studies, they still utilize statistic sets that are historically driven, vary study-to-study from previous literature, and are consistently insufficient to capture the wide variety of possible textures \cite{brown2021efficient}.

The problem of selecting which statistics are necessary and sufficient to represent the variety of textures perceived in peripheral vision is critical for the goal of building better models of human gaze. While testing every single texture by hand or with a human-in-the-loop is not feasible, we leverage self-supervised approaches in machine learning  to address the problem of statistic selection in peripheral vision models. In this work, we develop a constrastive learning model, StatTexNet, to explore the relative importance of pyramid-based statistics for representing peripheral vision. To validate our machine learning approach to statistic selection, we test our framework on a set of depleted texture images with known statistics. We demonstrate that StatTexNet selects the known most important statistics in these depleted textures. We then apply our model to full texture images and use weight interpretability metrics to determine what are the most important statistics to represent texture families. Finally, we synthesize textures using statistics selected by our method. 

By building a machine-learning-driven approach to statistic selection, our work automates the evaluation of statistics used by texture-based peripheral vision models. With a better method of understanding and evaluating peripheral vision models, we can build a more complete understanding of human gaze.

\section{Previous Work}
\label{sec:prevwork}

Peripheral vision represents the majority of the visual field, and both critically limits and enables human performance at a variety of tasks \citep{rosenholtz2016capabilities}. This includes gaze behavior where information from both the fovea and the periphery are integrated to inform saccades \citep{stewart2020review}.

Some of the best performing models of peripheral vision use a multi-scale pyramid approach. Most pyramid-based peripheral vision models are based on work from the texture modeling world. Early work in this area included \citep{julesz1962visual}, who first explored different textures that could be represented as the same N-th order pixel statistics. Large improvements were seen with a move from pixel-based to multi-scale pyramid based statistical representations \citep{simoncelli1995steerable}. The steerable pyramid has since been widely used in vision modeling as its filters resemble those found in the mammalian early visual system \citep{turner1986texture, malik1990preattentive}, which break down an input image into distinct spatial frequency and orientation bands. Using the steerable pyramid, Heeger and Bergen \citep{heeger1995pyramid} proposed a statistics set calculated on this pyramid decomposition, alongside a histogram-matching procedure that enabled good texture synthesis. This was refined further by \citep{portilla2000parametric}, which included pixel, autocorrelation, and magnitude statistics.

When these texture models were first applied to peripheral vision \citep{rosenholtz2012summary, freeman2011metamers}, they utilized a similar texture set to \citep{portilla2000parametric}. Statistics were modified from this set by being hand-chosen and tested for necessity and sufficiency through trial and error on a limited test set of textures. More recent work has modified these statistics slightly, tested them on a wider variety of conditions and textures, and made code more flexible and efficient \citep{brown2021efficient,wallis2017towards}. 

Behavioral evidence supports the statistics set used by these state-of-the-art peripheral vision models. These models are often used to create mongrels, also known as metamers, which are visual stimuli that match another in representational space, but can differ significantly in pixel space. When viewed foveally, the pixel-differences are obvious, but when viewed peripherally, they are indistinguishable. Mongrels have been shown through careful psychophysical experimentation to reproduce the same capabilities and limitations of human peripheral vision including crowding \citep{balas2009summary} and scene perception \citep{ehinger2016general}. In addition, the scaling parameters for pooling regions needed to create metamers/mongrels mirror those of neuron receptive fields in non-human primates \citep{freeman2011metamers}.

Despite the success of these models, it is clear that the current state-of-the-art statistic set is insufficient. A faithful model of human peripheral vision should work regardless of input type. However, investigations into the effect of different texture families have revealed that for current models, textures with certain properties are more faithfully represented, while metamers/mongrels of other texture types consistently fail \citep{brown2021efficient, broderick2023foveated}. These problems occur despite modifications to optimization strategy, hyperparmeters, and seed. 

Some efforts have worked to eliminate the need to choose specific statistics altogether. Mongrels have been successfully created by taking inspiration from style transfer \citep{gatys2016image}, utilizing the entire gram matrix as the statistical representation to create metameric images \citep{deza2017towards,wallis2017towards}. While this removes the need for hand-picking statistics, this represents a huge matrix that is likely over-parameterized, and removes any potential compression advantage. Another example is the work from \citep{serre2007robust}, which simply takes the maximum output of each pooled area. Although the field has made significant progress toward improving the statistic component of peripheral vision models, it is clear that a more principled approach to selecting the statistics is needed.

\section{Modeling Textures Through Statistics}
\label{sec:model}

\begin{figure}[h!]
  \centering
  \includegraphics[width=1.0\textwidth]{Images/Image_model.png}
  \caption{Our model compresses the representation of a texture model.}
  \label{fig:image_model}
\end{figure}

In order to build a better method of selecting the most important statistics for texture-based peripheral vision models, we devise a contrastive learning model, StatTexNet, to take a large set of statistics and compress it to a smaller set. In our model, we take 5-crops (4 corners and center) from a texture image dataset, and calculate their summary statistic representation using the GPU-optimized code from \citep{brown2021efficient} (Figure \ref{fig:image_model}). This consists of convolution of each 128x128 pixel crop with a steerable pyramid filter bank, and the calculation of 150 summary statistics from these pyramid images. We then use a single fully connected layer to compress this statistical representation, which we train through contrastive learning. The input space is thus 150. For the output latent space, we choose
a dimensionality of 50, as it provides the most effective clustering in our
experiments. While we use the statistics set from the \citep{brown2021efficient} model as a baseline, we note that this is a similar statistics set to other popular models \citep{portilla2000parametric,freeman2011metamers,rosenholtz2012summary}, with some statistics removed for computational savings, simplicity, and based on empirical findings, as well as the inclusion of an additional statistic set, 'end-stopped'.

\section{Summary Statistics Sets}
\label{sec:statsics}
StatTexNet starts with an initial set of summary statistics which are are split into two groups: \textit{first-order} and more complex \textit{second-order and higher} statistics.
\\

Following \citep{brown2021efficient}, we utilize the following statistics: 
\begin{itemize}
\item[]\textbf{First-order statistics:}
\item From the raw input image pixels, the first four moments — mean, variance, skewness, and kurtosis — of the grayscale histogram.
\item The variance of both the high- and low-pass bands, with skewness and kurtosis also computed for the latter.
\item For the non-oriented lowpass bands, the variance, skew and kurtosis are computed.
\item For each bandpass filter output, the  
 magnitude-mean and variance are derived.
\item[]\textbf{Second- or higher-order statistics:}
\item Magnitude-correlations between bandpass filters. This involves the correlations between all orientations at the same scale in the steerable pyramid, but also correlations between neighboring scales at the same orientations.
\item The same correlations are also computed for the phase images. 
\item Finally, unique to Brown et al is the \textit{End-Stopped} statistic. This statistic is based on end-stopped neurons or hypercomplex cells in visual cortex \citep{hubel1959receptive}, and differentiates between segmented and continuous lines. Specifically, each edge magnitude component image is subtracted from a slightly shifted version of itself, following the expected edge direction. The resulting difference is then squared. 
\end{itemize}

%\section{Using Depleted Textures With Known Statistics}

\begin{figure}[h!]
  \centering
  \includegraphics[width=1.0\textwidth]{Images/HB_texture_diagram.png}
  \caption{We generate depleted textures created with a known set of statistics, feeding these controlled images to our model, and perform the same procedure as in Figure \ref{fig:image_model}.}
  \label{fig:HB_model}
\end{figure}


Natural images are essentially unbounded by the set of statistics that represent them. However, synthesized images are created using only a set of known statistics. In order to control for the set of statistics present in a given texture image and validate our method of statistic selection, we create synthesized versions of each texture using the Heeger \& Bergen texture model \citep{heeger1995pyramid}. Heeger and Bergen preforms histogram matching on first order statistics \textit{only} and thus its syntheses are only constrained to this subset of statistics. These synthesized textures are \textit{depleted} in that they do not contain the full set of statistics needed to fully describe them. They can therefore be used to validate our method, as a model should not need higher-order statistics to group them. This enables us to test if a network can learn the relative importance of different groups of statistics from different datasets. To do this, we then follow the same pipeline as in Figure \ref{fig:image_model}, with these depleted images (Figure \ref{fig:HB_model}).

\section{Datasets}

\begin{figure}[h!]
  \centering
  \includegraphics[width=1.0\textwidth]{Images/dataset.png}
  \caption{Dataset visualization through sample textures. The top row indicates the original texture and the bottom row shows the synthesized texture through the Heeger and Bergen procedure.}
  \label{fig:dataset}
\end{figure}

In this study, we utilized two primary datasets: the Describable Textures Dataset (DTD) \citep{cimpoi2014describing} and the KTH-TIPS2-b (KTH) dataset \citep{mallikarjuna2006kth} which we use for validation. The DTD captures a wide array of textures found in natural settings and is a collection of 5,640 images spanning 47 distinct texture categories. These images were primarily sourced from platforms like Flickr and Google Search. The KTH dataset contains 4,752 images representing 11 different materials that were acquired through imaging 4 different samples for each material, each under varying pose, illumination and scale. 
Due to the way it was collected, DTD has more intra-class variation than KTH.

We transformed all RGB images from these datasets into grayscale. We then applied the Heeger and Bergen texture synthesis procedure \citep{heeger1995pyramid} to these grayscale images. The Heeger and Bergen approach is to iteratively modify a gaussian white noise image so that the pixel distributions in its steerable pyramid representation match that of the reference texture.  This is done through histogram matching. When provided an input image, histogram matching aims to adjust the image's grayscale pixel value distribution so that it aligns with the histogram of a reference image. Thus, histogram matching adjusts the pixel distribution of an image to match that of a reference, ensuring identical first-order statistics, but not guaranteeing similar spatial structures or correlations between images.

Consequently, we have four datasets at our disposal to test our hypotheses: two are the original grayscale sets (DTD and KTH), and the other two are depleted - derived from the Heeger and Bergen synthesis method applied to DTD and KTH. Figure \ref{fig:dataset} shows some examples of these datasets. 

\section{Training}

\subsection{Contrastive Learning}

Our goal is to reduce the full set of $150$ \citet{brown2021efficient} image statistics to a compressed representation 1/3 the size, forcing the network to prioritize information from certain textures over others. To do this, we employ constrastive learning \citep{chen2020simple}, allowing our network StatTexNet to learn any representation that is useful in discriminating textures. Contrastive learning works by ensuring that similar pairs, such as crops from the same image, are drawn close together in representation space, while distinct pairs are pushed apart based on a specified distance measure. For this task, we utilize generalized lifted structured loss \citep{hermans2017defense} with a Euclidian distance. The advantage of this loss is its ability to effectively process the entire training batch, taking into consideration both closely related pairs (positive anchors) and those that are unrelated (negative pairs). In one training step, all pairs are considered (See Appendix Section \ref{apd:CL}). 

For our input data, we take a single texture image from one of our datasets and crop it into $5$ smaller images. This gives use a set of $5$ images that we know come from the same texture, and thus, should be represented by a very similar statistical values. Crops from the same image are treated as positive samples and crops from different texture images are treated as negative in our framework. We train our contrastive learning networks for $200$ epochs.
%Using the texture databases, we take 5-crops from each image, we trained our network for 100 epochs. 
(See \ref{apd:apx_aug} for details on data augmentation). To process our data efficiently and ensure consistent gradient updates, we selected a batch size of $100$. Additionally, after evaluating different optimization techniques, we settled on the Adam optimizer due to its adaptive learning rate and proven success in similar tasks. We used a learning rate of $0.0001$. 

\subsection{Dropout}

One complication of our model is that correlations between different elements of the 150 statistics set could potentially cause the network to ignore certain highly correlated or anti-correlated statistics. We reasoned that, because our statistics sets represent uniform spatial samples of natural images that have inherent regularities \citep{ruderman1997origins,simoncelli2001natural}, correlation between statistics was highly likely. This could happen when multiple statistics correlate sufficiently such that the network learns to rely only on one of the correlated statistics, discounting others. 

To address, this we first checked for correlations among statistics (Appendix Section \ref{apd:correlations}), and found that indeed, the majority of statistics measured show high correlation with at least one other. We counteract this issue by incorporating dropout during training. By incorporating dropout, some features are set to zero temporarily at random during each forward pass. This prevents the model from becoming too reliant on specific features as it forces the model to learn a more even distribution across correlated groups. Thus, this approach mitigates the effects of multi-collinearity. We find that incorporating dropout greatly improves the results in Table \ref{tab:rankings}, compared to training without dropout.

\begin{figure}[h!]
  \centering
  \includegraphics[width=1.0\textwidth]{Images/tsne.png}
  \caption{t-SNE visualizations for learned embeddings of the DTD dataset for both original texture images (left), as well as depleted (right). The network learns to cluster textures from both well, while the right plot indicates that depleted images are clustered better.}
  \label{fig:tsne}
\end{figure}

\subsection{t-SNE}

In addition to seeing a reduction in loss over training, we validated the effectiveness of our contrastive learning approach using t-SNE \citep{van2008visualizing} to visualize the learned latent representation space. To do this, we ran inference on a set of 20 randomly-chosen textures, each with 5 crops, and visualized the 2D embedding of the 50 dimensional space (Figure \ref{fig:tsne}). We find that indeed, crops from the same texture cluster together well in space. This is especially true for the depleted textures synthesized with \citep{heeger1995pyramid}.

\section{Rankings for Depleted Data}

\begin{table}[h]
\centering
\caption{Importance metrics for 50 first-order summary statistics averaged over 10 seeds based on two different feature selection methods. For both orderings, all three measures, and both datasets, first-order statistics are ranked higher (are more important) for depleted textures created with these statistics only, than for original textures.}
\begin{booktabs}{|l|l|cc|cc|}
\toprule
& & \SetCell[c=2]{c} \textbf{DTD} & & \SetCell[c=2]{c} \textbf{KTH} \\
\midrule
Ordering & Metric & Original & Depleted & Original & Depleted \\
\midrule
 & \textbf{\% in Top 15} & 36.00 & 75.33 \checkmark & 80.07 & 100.00 \checkmark \\
\textbf{Weight} & \textbf{Median rank} & 59.45 & 41.60 \checkmark & 36.00 & 34.10 \checkmark \\
& \textbf{Mean rank} & 63.36 & 52.65 \checkmark & 46.99 & 44.71 \checkmark \\
\midrule 
 & \textbf{\% in Top 15} & 32.67 & 49.33 \checkmark & 43.33 & 70.67 \checkmark \\
\textbf{Shapley} & \textbf{Median rank} & 71.20 & 49.45 \checkmark& 64.30 & 44.85 \checkmark\\
& \textbf{Mean rank} & 67.75 & 55.55 \checkmark & 62.89 & 50.36 \checkmark \\
\bottomrule
\end{booktabs}
\label{tab:rankings}
\end{table}
% \multicolumn{2}{c}{\textbf{DTD}}
\subsection{Weight-Based Ordering}

After the training process, we do weight-based ordering on StatTexNet to determine the significance of each statistic. We summed up the absolute weight values for each input node in our model, where each input node corresponds to one of the 150 feature statistics. Because we use scaling in the weight matrix to normalize, a statistic more useful in classification should be weighted more highly by the network. We ordered these weights in descending order, ranking them from the most (low rank) to the least important (high rank). 

% For instance, raw statistics determine an image's brightness. Therefore, it is expected to see a substantial percentage of these statistics among the top 15 most important statistics. So, the observed $11$-$13\%$ difference between original and depleted datasets stands out as a meaningful observation.

% (Figure \ref{tab:rankings} - Top .

We followed three metrics to evaluate whether StatTexNet can learn the most important statistics across the different datasets. We hypothesized that the first-order statistics matched by Heeger and Bergen synthesis will play a more important role in the depleted data, than for the original textures. To test this, we calculated the weights for each dataset and then ranked the 150 statistics by their importance. As a first metric, we observed how many of the first-order statistics rank in the top 15 of overall most important statistics. Here, we expect that for the synthesized texture datasets, there will be a higher percentage of very important first-order statistics compared to the original datasets. (We note that raw statistics determine basic image properties, such as brightness - therefore it is expected that even in the original textures, a substantial percentage of the first-order statistics should be among the top 15, though an increase in their prevalance should be expected for the depleted textures.) Furthermore, we assessed both the mean and median importance rank of the 50 first-order statistics in the overall importance ranking. We took the mean value for these three metrics over 10 different seeds and observed consistent results across all of them. 

We find that for both datasets, and for all 3 measures (\% In Top 15, Median Rank, Mean Rank), relative rankings reflect our expected results. That is, when trained on the depleted dataset, StatTexNet consistently ranked the 50 first-order statistics higher (more important), than when trained on original textures (Table \ref{tab:rankings}, Top). For the KTH dataset, 100\% of the top 15 ranked statistics belonged to the first-order statistics for the depleted data (this occurred in all of the 10 separate trainings with different random seeds). This indicates that our framework is sound, and weight-based ordering is able to identify the most and least important statistics for a contrastive learning task.

%The median and mean rank of the first-order statistics further support our primary findings. The median, in particular, presents a more pronounced difference as it is less influenced by outliers. The difference in means is less noticeable, given that certain statistics inherently rank lower in importance across both datasets. 

%For instance, raw statistics determine an image's brightness. Therefore, it is expected to see a substantial percentage of these statistics among the top 15 most important statistics. S

\subsection{Shapley Value-Based Ordering}

While weight-based ordering using the average absolute value of weights offers strong support of our hypothesis that depleted data would favor first-order statistics more heavily, we sought a more sophisticated mathematical approach to test and validate our findings. Calculating Shapley values \citep{roth1988shapley} is an interpretability method based in game theory enabling the assignment of credit to individual inputs for a given output in a machine learning model. We utilized the SHAP package \citep{NIPS2017_7062} to calculate Shapley values for each of the 150 statistics, and used these values in place of absolute value of weights to order statistics by importance. 

We find that the rankings based on Shapley values also support our hypothesis that depleted texture-trained networks will more heavily rely on the 50 first-order statistics than networks trained on their complete texture counterparts (Table \ref{tab:rankings}, Bottom). Given these results indicating the strong utility of ranking via Shapley values, we chose to utilize this ranking procedure alongside weight in exploring the statistical importance for non-depleted data.

% The overall group statistics average further solidify our hypothesis. Over the four barplots in figure \ref{fig:groupranking}, it is noticeable that the first-order statistics groups are overall more important in the synthesized dataset. Further, marginal statistics play an important role across all datasets while phase-correlations are less important on average. 

\section{Statistical Importance for Original Textures}

\begin{figure}[h!]
  \centering
  \includegraphics[width=1\textwidth]{Images/WeightShapleyScatter.png}
  \caption{Statistic rankings for two datasets tested. Small points indicate individual statistics, large points indicate group statistic means (circle). Phase-correlation statistics are consistently of low importance, while most other statistics families show heterogeneous performance. Shapley ranking of statistics shows better correlation between datasets tested.}
  \label{fig:stat_scatter}
\end{figure}

Having validated that our method works in identifying the most and least important statistics for texture representation, we turn to the results on original (non-depleted) textures. First, to understand the relative importance of each statistic, we computed the mean ranking of the nine statistics groups (Figure \ref{fig:stat_scatter}, bar plots in Figure \ref{fig:groupranking}), averaged over 10 seeds. 

We find that overall, bandpass variance (a single statistic) has high ranking between both datasets and ranking procedures (especially for DTD), indicating that it is important. Magnitude-mean statistics also cluster consistently towards high rankings. Most of the other statistics show a wide distribution of rankings. This is true across datasets, within datatsets, and for both ranking systems. End-stop and magnitude-correlation statistics in particular show highly distributed rankings, appearing as both some of the most and least important statistics. 

We find that phase-correlation is consistently ranked far lower than all other statistics classes, with strong rank grouping near the end, indicating that it is a less important statistic overall. Our findings of phase-correlation being less important are consistent with previous psychophysical literature \cite{balas2006texture}, which found phase-correlation to be unimportant for discriminating textures. Interestingly, only our weight-ranked results are consistent with their findings that marginals are highly important for discrimination.

% In addition to  also performed a different training and feature selection method. Here we
% performed classification on the encoded textures and analyzed whether a multilayer network
% can correctly classify textures based on their statistics. As a statistics selection method, we
% performed feature permutation and observed the accuracy drop. While the results supported
% our hypothesis, there were strong correlations between some features observed, thus distorting our classification analysis. 



\section{Synthesis}

\begin{figure}[h!]
  \centering
  \includegraphics[width=1.0\textwidth]{Images/Browngenerations.png}
  \caption{These three textures represent synthesis failures and success classes based on \citet{brown2021efficient}. High contrast (first row, lined), middle contrast (middle row, porous) and low (bottom row, painted). Low roughness/coarseness textures (bottom row) have poor syntheses for even the full statistics set. Magnitude-mean is important for high and middle contrast textures as shown in the first two rows. Phase-correlation can be removed without much quality loss as compared to the full statistics synthesis.}
  \label{fig:generation}
\end{figure}
One advantage of the texture/peripheral models studied here is the ability to synthesize textures based on a given statistics set. This allows us to visually validate our results. While we emphasize that synthesis results have high variation being both highly seed and texture dependent \citep{brown2021efficient,broderick2023foveated}, we nonetheless include some syntheses here, demonstrating the effect of depleting various statistics. 

We show examples of textures with properties found by \cite{brown2021efficient} to be most and least well-captured by the full texture set. We find that high contrast textures like the lined texture, demonstrate similar performance to baseline (All) when the less-important phase-correlation statistic is removed, but fail completely when the highly-important magnitude-mean statistic is removed.  Lower contrast textures, like the painted image, however, show similarly poor synthesis in all cases. The porous texture, lying somewhere in between, has similar synthesis performance to baseline when phase-correlation is removed, and a slightly worse performance when magnitude-mean is removed. Our observations in Figure \ref{fig:groupranking} align with this, highlighting that the magnitude-mean statistics are notably important compared to the phase-correlation statistics. Given that the phase-correlation statistics comprise a greater number of statistics than magnitude-mean, this offers a meaningful point of comparison.

These syntheses support the results uncovered here through our contrastive learning approach. While the 150 statistics of \cite{brown2021efficient} are not sufficient for all textures, removal of the phase-correlation statistic is often not important, while removal of the magnitude-mean statistic is often noticeable, and sometimes catastrophic.

\section{Discussion}

In this work, we combine self-supervised learning with weight interpretability analysis to develop, validate, and use a novel method that enables the principled selection and prioritization of the texture summary statistics underlying modern peripheral vision models. By adding a single fully-connected layer to a texture model, we create StatTexNet which we train with contrastive learning to prioritize the most important statistics on the task of grouping textures from the same family together. We show that StatTexNet successfully learns to group textures -- indicating that it learned an optimal statistical representation of texture.

In addition, we use multiple weight interpretability metrics to order the relative contribution of individual statistics. To validate this ordering, we create a depleted texture set which is synthesized with a reduced set of statistics, train our network on these textures, and confirm that these reduced set of first-order statistics are the most important in grouping depleted textures as compared to original ones. We show that this result is consistent for 6 different orderings/metrics across 2 different datasets, averaged over multiple seeds. 

Finally, we use this method to identify the relative importance of statistics in representing natural textures. When averaging over the sometimes heterogeneic texture families, we find that bandpass variance and magnitude-mean are the most important overall, while phase-correlation is least important. We show that our results are consistent not only with a small sample of synthesized textures, but also with previous psychophysical literature \citep{balas2006texture}, which used psychophysical methodology to evaluate discrimination abilities for depleted textures. While their results found marginal statistics among the most important for the task of texture discrimination, like our work they find that cross-scale phase statistics to be among the least important for this task. 

Overall, our method demonstrates a novel, efficient, and principled approach to selecting the statistics for peripheral vision models, as well as the pyramid-based texture-based models that underlie them. While a human in the loop will likely always be necessary to fully validate a statistics set, our method can make such experiments more directed, as testing all possible subsets of even 150 statistics in a formal eye-tracked psychophysics experiment is not feasible. 

Future work could scale-up our approach using the larger set of statistics from models such as \cite{portilla2000parametric,freeman2011metamers,rosenholtz2012summary}, or a novel, much larger set of possible statistics. Additionally, the human visual system is thought to use highly complex transforms and performs a variety of tasks beyond grouping textures. Our method could be utilized to explore the effect of modeling more complex transformations on statistical importance, or the effect of alternative tasks such as classification, as more complex multi-layer weight structures are compatible with the Shapley method demonstrated here. Overall, with our principled and scalable approach to statistic selection, we can work toward better models of texture, peripheral vision, and human gaze as a whole.

\section{Acknowledgements}

This work was funded by the CSAIL MEnTorEd Opportunities in Research (METEOR) Fellowship, US National Science Foundation under grant number 1955219, as well as National Science Foundation Grant BCS-1826757 to PI Rosenholtz. The authors acknowledge the MIT SuperCloud \cite{reuther2018interactive} and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this paper. 

%The bibliography is displayed using \verb|\bibliography|.

% \acks{We thank Ayush Tewari, Eric Li, and Ge Yang for helpful comments on this work. }

\newpage

\bibliography{pmlr-sample}

\newpage

% \appendix

\section{Appendix}

\subsection{Implementation}

The implementation of this project is available as a Github repository at \newline https://github.com/RosenholtzLab/StatNetExperiments.

\subsection{Contrastive Learning}
\label{apd:CL}

Our aim is to develop a function \(f_\theta(x) : \mathbb{R}^F \rightarrow \mathbb{R}^D\) that pushes encoded crops from the same classes in \(\mathbb{R}^F\) closer together in \(\mathbb{R}^D\). On the other hand, crops in \(\mathbb{R}^F\) originating from different textures are pushed further apart in \(\mathbb{R}^D\).

The function \(f_\theta\) is parameterized by \(\theta\).
In this work $\theta$ represents the collective set of weights and biases of the neural network that are learned and adjusted during training to achieve the desired embeddings in the 50-dimensional space. For encoded textures $x$, the loss function employed is given through:
\begin{equation}
\text{L}(\theta; X) = \sum_{i=1}^{P} \sum_{a=1}^{K} \left[ \log \left( \sum_{\substack{p=1 \\ p \neq a}}^{K} e^{D(f_{\theta}(x^i_a), f_{\theta}(x^i_p))} \right)  + \log \left( \sum_{j=1}^{P} \mathop{}_{\substack{j \neq i}} \sum_{n=1}^{K} e^{m-D(f_{\theta}(x^i_a), f_{\theta}(x^j_n))} \right) \right]_+ \nonumber
\end{equation}
Here, the first term in the bracket are all positive pairs and the last term all negatives. The two summations indicate that we consider all pairs at once. 
% \text{L}(\theta; X) = &\sum_{i=1}^{P} \sum_{a=1}^{K} \left[ \log \left( \sum_{\substack{p=1 \\ p \neq a}}^{K} e^{D(f_{\theta}(x^i_a), f_{\theta}(x^i_p))} \right) \right. \notag \\
% &\left. + \log \left( \sum_{j=1}^{P} \mathop{}_{\substack{j \neq i}} \sum_{n=1}^{K} e^{m-D(f_{\theta}(x^i_a), f_{\theta}(x^j_n))} \right) \right]_+ 
As in \citep{hermans2017defense}, the distance measure used is the 
Euclidean distance:
\begin{equation}
     D(f\theta(x_i), f\theta(x_j)) = \lVert f\theta(x_i) - f\theta(x_j) \rVert_2 \nonumber
\end{equation}

\subsection{Data augmentation}
\label{apd:apx_aug}
For our self-supervised learning, we apply several transformations to the images. We use a random vertical flip with a 0.5 probability and a horizontal flip with the same probability. At the final step, we get five crops from the adjusted image: one from each corner and one from the center. These five crops all represent one class in the dataset and the contrastive learning setup. We avoided most transformations such as blurring or jittering because they could change the statistic values. After augmenting, we encode the five cropped images using the 150-statistic set. To keep the data consistent, we normalize the statistics with the Scikit standard scaler. This helps ensure our network is not influenced by varying statistic sizes. These normalized statistics are then processed through a single-layer network with input size $150$ and output size of $50$.

\subsection{Labeling of statistics}
\label{apd:labeling}
The labeling of statistics is systematic, driven by their statistic group and the filter of the steerable pyramid they are derived from. We follow three distinct patterns of labeling. 
\begin{itemize}
\item Non-correlation statistics: These are indicated in the format "statistic level orientation". For instance, "end stop 1 1" refers to the end stop statistic for the first orientation at the first pyramid level.
\item Correlations between neighboring scales: This follows the format "statistic (level\_1, level\_2) orientation", i.e. "magnitude\_correlation (2,3) 3", signifying a correlation between the second and third levels for the third orientation.
\item Correlations within a level across different orientations: These are denoted as "statistic level (orientation\_1, orientation\_2)". This structure labels the correlation occurring within a specific pyramid level but across various orientations such as magnitude correlation 1 (1,3).
\end{itemize}

\subsection{Correlations in Statistics}
\label{apd:correlations}

\begin{figure}[h!]
  \centering
  \includegraphics[width=1.0\textwidth]{Images/correlations.png}
  \caption{Correlation heatmap for all 150 statistics. Strong red color indicates positive correlation (1.0), while dark blue color anti-correlation (-1.0). There are high correlations between many statistics, especially within-group. There is also a subset of statistics that are anti-correlated or non-correlated.}
  \label{fig:statcorrs}
\end{figure}

We expected that many of the statistics measured in our analysis were likely to be correlated due to the regularities present in natural images. To investigate the degree to which correlations between different statistics are present in our analysis, we calculated the correlation between each statistic over the dataset, then used Spearman Correlation to group the statistics.

We find that indeed, many statistics are highly correlated with each other. Marginals show strong correlation with other marginals, but little correlation with other statistics. The entire population of end-stopped and magnitude statistics together have strong correlation. In addition, there are strong repeated patterns of correlation and anti-correlation between phase and magnitude statistics. Statistics of the same type and scale/level share these patterns and cluster together. 
\newpage

\subsection{Statistic Importance by Group}
\begin{figure}[h]
  \centering
  \includegraphics[width=1.0\textwidth]{Images/GroupRankBar.png}
  \caption{Mean ranking for the statistics groups with rankings based on weight (left), and Shapley values (right), for original textures of both datasets tested. Bandpass variance statistics generally rank with high importance, and phase-correlation statistics consistently rank with low importance.}
  \label{fig:groupranking}
\end{figure}

\newpage

\subsection{Most and Least Important Statistics}
\label{apx:most_least_list}

\begin{table}[h]
\centering
\caption{10 most important statistics for the DTD \& KTH dataset averaged over 10 seeds based on Shapley feature selection methods.}
\begin{booktabs}{|ll|ll|}
\toprule
\SetCell[c=2]{c} \textbf{DTD} & & \SetCell[c=2]{c} \textbf{KTH} \\
%\multicolumn{2}{c}{DTD} & \multicolumn{2}{c}{KTH} \\
\midrule
Stat & Avg Rank & Stat & Avg Rank \\
\midrule
end\_stop 1 1 & 7.90 & end\_stop 3 2 & 5.80 \\
end\_stop 1 3 & 9.90 & magnitude\_mean 3 3 & 13.70 \\
magnitude\_correlation 1 (0, 2) & 10.50 & end\_stop 3 0 & 14.30 \\
end\_stop 1 0 & 11.40 & magnitude\_mean 3 1 & 15.80 \\
end\_stop 1 2 & 11.90 & magnitude\_correlation (3, 4) 3 & 16.50 \\
magnitude\_correlation 1 (1, 3) & 12.80 & magnitude\_correlation 3 (0, 2) & 16.70 \\
magnitude\_mean 1 3 & 14.20 & end\_stop 2 2 & 17.80 \\
magnitude\_mean 1 1 & 14.50 & magnitude\_variance 3 3 & 19.00 \\
magnitude\_mean 1 2 & 19.70 & magnitude\_correlation (2, 3) 3 & 20.40 \\
magnitude\_correlation 1 (0, 3) & 19.90 & magnitude\_mean 4 3 & 20.50 \\
\bottomrule
\end{booktabs}
\end{table}


\begin{table}[h]
\centering
\caption{10 least important statistics for the DTD \& KTH dataset averaged over 10 seeds based on Shapley feature selection methods.}
\begin{booktabs}{|ll|ll|}
\toprule
\SetCell[c=2]{c} \textbf{DTD} & & \SetCell[c=2]{c} \textbf{KTH} \\
%\multicolumn{2}{c}{DTD} & \multicolumn{2}{c}{KTH} \\
\midrule
Stat & Avg Rank & Stat & Avg Rank \\
\midrule

phase\_correlation (2, 3) er*di 1 & 132.30 & phase\_correlation (2, 3) ei*di 1 & 135.90 \\
phase\_correlation (2, 3) ei*di 2 & 133.10 & phase\_correlation (1, 2) ei*di 1 & 136.10\\
phase\_correlation (2, 3) ei*di 0 & 133.50 & phase\_correlation (1, 2) ei*di 3 & 136.10\\
phase\_correlation (2, 3) ei*di 1 & 133.50 & phase\_correlation (3, 4) er*di 0 & 137.10\\
phase\_correlation 1 er (0, 2) & 134.30 & end\_stop 4 2 & 137.90\\
phase\_correlation (2, 3) ei*di 3 & 135.10 & phase\_correlation 1 er (0, 2) & 138.10\\
phase\_correlation 2 er (0, 2) & 136.20 & phase\_correlation 2 er (0, 2) & 139.40\\
end\_stop 4 2 & 136.40 & phase\_correlation (2, 3) ei*di 0 & 139.80\\
phase\_correlation (3, 4) ei*di 0 & 138.20 & phase\_correlation (2, 3) er*di 0 & 140.00\\
phase\_correlation (3, 4) er*di 0 & 139.90 & phase\_correlation (2, 3) er*di 2 & 140.60\\

\bottomrule
\end{booktabs}
\end{table}


\end{document}
