%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}
% Recommended, but optional, packages for figures and better typesetting:
\usepackage{microtype}
\usepackage{graphicx}
%\usepackage{subfigure} % Do not include this package; if so, it will ruin subfigure formatting. ONLY use \usepackage{subcaption}
\usepackage{booktabs} % for professional tables
\usepackage{multirow}
\usepackage[table,xcdraw]{xcolor}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

% Additional packages to preamble the formulas, tables
\usepackage{macros}
\usepackage{enumitem}
%\usepackage{geometry}      % do not include this package--it will change the required icml 2023 format
\usepackage{url}            % simple URL typesetting
\usepackage{makecell}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{floatrow}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{amsmath}
\usepackage{bbm}
% \usepackage{multirow}
% \usepackage{graphicx}
% \usepackage[table,xcdraw]{xcolor}

\renewcommand{\a}{\alpha}
\renewcommand{\b}{\beta}
\newcommand{\xhdr}[1]{\vspace{0.1mm}\noindent{{\bf #1.}}}
\setlength{\abovedisplayskip}{-30pt}
\setlength{\belowdisplayskip}{-15pt}
\setlength{\abovedisplayshortskip}{-30pt}
\setlength{\belowdisplayshortskip}{-15pt}
\setlength{\textfloatsep}{10pt plus 0.0pt minus 0.0pt}
\setlength\abovecaptionskip{0pt}
\setlength\belowcaptionskip{0pt}
% Table float box with bottom caption, 

\title{Studying the Effect of GNN Spatial Convolutions On The Embedding Space's Geometry}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Claire Donnat}
\author[1]{So Won Jeong}
% Add affiliations after the authors
\affil[1]{%
    Department of Statistics\\
    The University of Chicago\\
    Chicago, Illinois, USA
}
%\affil[2]{%
%    Second Affiliation\\
%    Address\\
%    …
%}
  
\begin{document}
\maketitle

\begin{abstract}
By recursively summing node features over entire neighborhoods, spatial graph convolution operators have been heralded as key to the success of Graph Neural Networks (GNNs). Yet, despite the multiplication of GNN methods across tasks and applications, the effect of this aggregation operation has yet to be analyzed. In fact, while most recent efforts in the GNN community have focused on optimizing the architecture of the neural network, fewer works have attempted to characterize  \textit{(a)  the different classes of spatial convolution operators},  {\it (b) their impact on the geometry of the embedding space}, and {\it(c) how the choice of a particular convolution should relate to properties of the data}. In this paper, we propose to answer all three questions by dividing existing operators into two main classes ({\it symmetrized} vs. {\it row-normalized} spatial convolutions), and show how these correspond to different implicit biases on the data. Finally, we show that this convolution operator is in fact tunable, and explicit regimes in which certain choices of convolutions --- and therefore, embedding geometries --- might be more appropriate.
\end{abstract}
%\vspace{-0.3cm}

\section{Introduction and Motivation} \label{sec:intro}
% !TEX root = main.tex
 As the extension of the Deep Neural Network (DNN) machinery to the graph setting, Graph Neural Networks (GNN) offers a powerful paradigm for extending Machine Learning tools to the analysis of relational data, typically modeled through a graph \citep{battaglia2018relational,dong2020graph, hamilton2017inductive, kipf2016semi, wu2020comprehensive,zhou2020graph}. The recent impressive success of GNNs across tasks and applications \citep{casas2019spatially,gaudelet2021utilizing,gilmer2017neural,li2021representation,ma2019genn,wu2020graph} in dealing with this complex data type has been attributed to their two main components: \textbf{(a) a convolution operator} $\mc{C}$ (or propagation operator \citep{zhou2020graph,zhou2020understanding}) that aggregates information contained in the neighborhood of any given node to create neighborhood-aware embeddings; and \textbf{ b) a  non-linear layer} --- or transformator\citep{zhou2020understanding} ---, that multiplies the convolved features by a weight matrix before applying a non-linearity. All GNN parameters are typically trained in an end-to-end fashion and --- as for Deep Neural Networks --- have been deemed essential in creating powerful {non-linear node embeddings} that can be tailored to any downstream prediction task. 
Depending on the architecture of the network, such graph convolution blocks are then stacked and/or repeated to encode varying ``receptive depths'' \citep{frasca2020sign, kipf2016semi,wu2019simplifying}. More formally, denoting as $H^{(k)}$  the output of the $k^{th}$ layer, GNNs can be broadly understood as a sequence of node convolutions of the form $H^{(k)}_u = \sigma(\mc{C}_{\mc{N}(u)}( H_u^{(k-1)}) W_k  + b_k)$,
where $H_u^{(0)} = X_u$ are node $u$'s raw features, $\sigma$ is a non-linearity (e.g. ReLU), and $\mc{C}_{\mc{N}(u)}$ is the convolution operator applied to the neighborhood $\mc{N}(u)$ of node $u.$
The final layer is always taken to be linear and written as:
\vspace{-0.15cm}
\begin{gather} \label{eq:last_layer}
    H_u^{(K)} = \mc{C}_{\mc{N}(u)}(H_u^{(K-1)})  W^{(K)}  + b^{(K)}.
\end{gather}

\xhdr{The convolution operator} Most existing theoretical analyses of GNNs differentiate two main classes of  convolution operators $\mc{C}$\citep{zhou2020graph}:  {\it (i) spectral operators} \citep{defferrard2016convolutional, dong2020graph,gama2018convolutional,gama2018diffusion,henaff2015deep}, that build off of the eigenvectors of some version of the Graph Laplacian (defined in its unnormalized version as $L= D-A$, with $A$ the adjacency matrix and $D$ the degree matrix); and {\it (ii) spatial operators}, that recursively aggregate node features within a given neighborhood. While in practice, the dichotomy between the two is attenuated by their implementation (spectral methods usually resort to using low-order Chebychev polynomials of the Laplacian \citep{defferrard2016convolutional,shuman2011chebyshev,shuman2013emerging}, thus effectively ``localizing'' the signal), these two approaches have different interpretations and implications for downstream data analysis. In this paper, we propose to focus on spatial convolutions, as popularized by the framework of Kipf et al \citep{kipf2016semi}. In this case, the convolution operator usually amounts to summing nodes features over entire neighborhoods $\mc{N}(u)$, so that the convolution becomes the function: $\mc{C}: \{X_v\}_{v\in \mc{N}(u)} \to \mc{C}(\{X_v\}_{v\in \mc{N}(u)})=SX$, where $S$ is a weight matrix. \cite{kipf2016semi} suggest taking $S$ to be  normalized adjacency matrix with added self-loops:  $S =  \hat{D}^{-1/2} \hat{A} D^{-1/2}$, where  $\hat{A}=A+I$  is the adjacency matrix of the graph with self-loops, $\hat{D}$ its diagonal degree matrix, and $X\in \mathbb{R}^{n \times p}$ is the feature matrix for the $n$ nodes.  The summation is crucial in preserving permutation invariance over the neighborhood and in encoding long-range dependencies by repeatedly stacking GNN layers --- thereby allowing information to percolate through the graph.


\xhdr{Variations on the convolution operator} While GNN methods have grown increasingly popular, the {form of the  spatial convolution operator} seems to have received less attention from the community. Table~1 in Appendix A provides a summary of the main graph convolution operators that are currently available in the Pytorch geometric package \citep{fey2019fast} --- here taken as a proxy for the most popular convolution choices. As observed from this table, all of the spatial operators rely on a (weighted) sum of neighborhood features, in line with the framework of Kipf et al \citep{kipf2016semi}. In fact, while GraphSage \citep{hamilton2017inductive} has been attempted using alternative permutation-invariant aggregators (such as ``max'' pooling and an LSTM version of the convolution operator), the authors did not report any significant advantage of these methods over the simple sum. Similarly, in \citep{xu2018powerful}, the authors show that a simple summation is often sufficient to characterize a multiset. {\it Thus, without loss of (practical) generality, we restrict our study to sum-based aggregators}. In this setting, two additional variations have appeared over the recent years: {\it (i) attention-based spatial convolutions} \citep{brody2021attentive,velivckovic2017graph, xie2020mgat}, which attempt to learn tailored edge weights; and more broadly, {\it (ii) weighted spatial convolutions} (e.g. using graphon weights, \citep{parada2021graphon,ruiz2020graphon}, kernels \citep{nikolentzos2020random,feng2022kergnns} among others\citep{zhang2020wagnn}), that refine the adjacency matrix by endowing it with edge intensities. While efforts have thus focused on defining ``the best edge weights'', little attention has been put on formally analysing the convolution itself. Yet, attention-based convolutions tend to be {\it row-normalized}, while previous methods (e.g., GCN \cite{kipf2016semi}, GIN\cite{xu2018powerful}) usually suggest weighted {\it symmetrized} adjacencies. Our purpose here is to show that this choice is not trivial, and translates into fundamental differences in the geometry of the embedding space.

\xhdr{Prior work: studying the effect of the Convolution operator} The literature focusing on understanding the mechanisms behind the success of Graph Neural Networks has considerably expanded over the past few years. Most notably, an important body of work has focused on understanding the role of this convolution operator in phenomena such as oversmoothing \citep{oono2019graph,cai2020note,chen2020measuring} and oversquashing\citep{alon2020bottleneck,topping2021understanding}. Most of these analyses are led from a spectral perspective, relating the behavior of the embeddings to the convolution operator's eigenvalues.  However, to the best of our knowledge, none of these approaches have specifically focused on understanding the effect of the convolution operator on the organization of the underlying embedding space, nor has attempted to tie these properties to  any {\it topological features}. While methods have tried to embed nodes in specific manifolds with desirable geometries (e.g., hyperbolic \citep{liu2019hyperbolic,law2021ultrahyperbolic}), no work seems to have studied the geometry induced by the convolution itself. We posit that such considerations are nonetheless fundamental in our understanding of GNNs and their stability.


\xhdr{Contributions} The objective of this paper is to answer three questions: \textit{(a) how does the choice of a particular convolution operator affect the organisation of the embedding space},  \textit{(b) how does it relate to the original properties (i.e. node features, graph distances or topological attributes)}, and {\it (c) what is the most appropriate convolution operator for a given dataset?} We will attempt and answer all three questions by studying two larger families of row-normalized and symmetrized convolution operators (parametrized by the variables $\alpha \in [0,1]$ and  $\beta \in \R^+$), allowing us to show how the convolution operator itself is in fact tunable.  In particular, we will show different values of $\alpha$ and $\beta$ impact the organization of the latent space (Section \ref{sec:geometry}) and the inherent geometry of the embeddings (Section \ref{sec:geometry-intrinsic}). Finally, we will characterize regimes in which certain choices of operators might be more relevant than others.

\section{Defining a family of convolution operators} \label{sec:convolutions}
To analyse the impact of the convolution operator on the embedding geometry, we define two main families of spatial convolutions:
\begin{description}[topsep=-0.5em, itemsep=0em, leftmargin=3mm]
\item[a. Symmetrized Convolutions Operators,] defined as the family of operators of the form: 
    \begin{align}\label{eq:normalized}
     \mathcal{F} =\big\{ &M_{\alpha, \beta} = D_{\beta}^{-\alpha} (A + \beta I)D_{\beta}^{-\alpha} \big | \alpha \in [0,1], \nonumber\\ & \beta \in \R^+,  D_{\beta}  =\text{diag}\big((A + \beta I) \mathbbm{1})\big)  \big\}.
    \end{align}

Here, $ D_{\beta}$ is the degree matrix associated with the $\beta$-augmented adjacency matrix $A+\beta I$.
Note that this family is a generalization of the traditional GCN convolution $S_{GCN} = \hat{D}^{-1/2}\hat{A}\hat{D}^{-1/2}$, which corresponds here to a choice of $\alpha=0.5$ and $\beta=1$. The choice $\alpha=0.5$ and $\beta=2$ also feature amongst the default implementations in PyTorch Geometric \citep{fey2019fast}. Similarly, the convolution operator indexed by  $\alpha=0, \beta=1$ corresponds to a version of the GIN convolution \citep{xu2018powerful}, and more generally, to the sum-based message-passing versions of GNNs \citep{battaglia2018relational}.
\item[b. Row-normalized Convolution operators,] which we define as the general family of the form: 
    \begin{align}\label{eq:regularized}
    \mathcal{M} =\big\{ D_{(\alpha, \beta)}^{-1} M_{\alpha, \beta} \big| & D_{(\alpha, \beta)} = \text{diag}\big(M_{\alpha, \beta} \mathbbm{1}) \big), \nonumber \\ &M_{\alpha, \beta} \in \mathcal{F}\big\}. 
    \end{align}
This family encompasses a number of operators, such as the sum-based convolution deployed in GraphSage\citep{hamilton2017inductive} --- and, to some extent, that of GAT  \citep{velivckovic2017graph}, which considers row-normalized convolutions of a ``learned'', modified version of the adjacency matrix. 
\end{description}
For both families of convolution operators, the parameter $\a$ impacts the weights assigned to nodes as a function of their degree: as $\alpha$ increases, high-degree nodes are increasingly penalised, so that their contribution to neighboring node embeddings decreases (relative to lower-degree nodes). On the other hand, $\beta$ can be interpreted as capturing the amount of ``innovation'' or relevant information that the source node brings to the embedding with respect to its neighborhood. In particular, for high values of $\beta$, the source node's contribution to the node outweighs that of the neighborhood, so that the embedding becomes essentially identical to its source. Consequently,  $\beta$ allows us to interpolate between the traditional GNN regime ($\beta=1$) and the MLP setting. Table~1 in Appendix~A lists a number of popular GNN convolutions, along with their associated families and parameters --- thereby highlighting the ubiquity of this framework.

\xhdr{Empirical consequences of a choice of operator} While  seemingly specious, the distinction between these two families corresponds in fact to different assumptions on the nature of the data. Consider the (potentially weighted) adjacency matrix with self-loops as a similarity matrix. GNN row-normalized convolutions bear a striking resemblance with the diffusion maps suggested by \cite{coifman2006diffusion} for embedding nodes in the ``featureless'' case where $X=I$. Diffusion maps are embeddings provided by the eigenvectors of the matrix:
$ M=  D_{\alpha, \beta}^{-1} M_{\alpha, \beta}, \quad M_{\alpha, \beta} \in \mathcal{F}$  for an appropriate choice of $\beta$. \cite{coifman2006diffusion} show indeed more generally that the eigenvectors of the matrix $M^t, t\in \R^+$ allow to recover the structure of manifold underlying the graph at larger and larger scales. The stacking of the different GNN convolutions without non-linearities (e.g. \cite{wu2019simplifying}) can be compared to a variation of the diffusion process proposed by \cite{coifman2006diffusion}: in the case of GNNs, $t\in \mathbb{N}$ is discrete and corresponds to the depth of the network.  This connection is interesting.
\cite{coifman2006diffusion} emphasized indeed the importance of the choice of the $\alpha$: if the data density is assumed to be uniform, then a choice of $\alpha=0$ ensures that the operator is an unbiased estimate of the Laplace-Beltrami operator. Conversely, $\alpha=1$ is better suited for manifold estimation with non-uniform sampling densities. While GNNs consider embeddings of both the graph and its node features, we posit that $\alpha$ might have a similar effect on the estimation procedure.  This comparison implicitly assumes that the data lives on some smooth Riemannian manifold, where the degree or centrality of a node corresponds to an assumption on the sampling distribution over $\mathcal{M}$: nodes with high degree correspond to ``well-sampled'' areas of the manifold.  Therefore, in this setting, it is intuitively possible to get an accurate representation of the local information by averaging neighbourhood features:  the higher the degree, the higher the amount of certainty around the node's value. By contrast, symmetrized embeddings are weighted sums --- but not convex combinations --- of neighbours. Here, the sum simply plays the role of a permutation-invariant aggregation operator\citep{hamilton2020graph}, and as we will see, is able to encode topological features (e.g. structural roles \citep{donnat2018learning}). 
%In particular, as we will in the next subsection, this embedding allows to embed topological features.

\xhdr{Experiments} Finally, to motivate our study before diving into more theoretical considerations, we propose to highlight the impact of the choice using standard benchmarks in the literature (we refer the reader to Appendix E for an overview of the properties of these datasets)\footnote{The code for the experiments can be found \href{https://github.com/sowonjeong/gnn-geometry-uai}{here}}.  Figure~\ref{fig:introb} highlights the impact of the value of  $\alpha$ and $\beta$ on the classification accuracy for Cora. Note here that we are not suggesting the use of any particular tuple $(\alpha,\beta)$ for Cora, but simply highlighting its influence on the performance of the algorithm. In particular, for the symmetric operator, the value of $\alpha$ is the main driver of the difference in performance. By contrast, the choice of $\beta$ affects less the performance of the GNN--- unless $\beta$ becomes too big and outweighs the rest of the neighbors. This effect might be due to the significant level of homophily in Cora: in this setting, the source node's feature vector is fairly redundant with that of its neighbours. However, we show in Appendix E additional examples where the impact of $\beta$ is much more substantial.  Most strikingly, the choice of $\alpha$ seems to have a significant effect on the performance of the symmetrized GNN, with a phase transition at $\alpha=0.5$: for values of $\alpha$ greater than this threshold, the performance drops quite substantially. Noticeably, in the ``poor'' performance region, the interaction between $\alpha$ and $\beta$ is more marked: choosing the low value of $\beta$  (i.e., $\beta=0$) seems to mitigate the decrease in performance. 
\begin{figure*}
     \centering
          \begin{subfigure}[t]{0.37\textwidth}
              \centering
    \includegraphics[width=1.0\textwidth, height=3.8cm]{Experiment_Figures_Fin/Accuracy/cora_accuracy.png}
    \caption{Accuracy as a function of $\alpha$ ($\beta=1$).}
     \end{subfigure}
\begin{subfigure}[t]{0.62\textwidth}
   \centering
    \includegraphics[width=\textwidth]{FIGS/plot_alpha_beta_accuracy_35experiments (1).png}
    \caption{Cora dataset: test accuracy as a function of $\alpha$ and $\beta$}\label{fig:introb}     
    \end{subfigure}
    \caption{Results for Cora for our family of convolutions defined in Eq.\ref{eq:normalized} and Eq.\ref{eq:regularized} (50 independent experiments, selecting a random training set and test set). Note the strong dependency of the results on both $\alpha$ and $\beta$ for the normalized convolution.}
    \label{fig:intro}
\end{figure*}


\begin{table*}[ht]
\centering
\resizebox{\textwidth}{!}{%
\begin{tabular}{c|c|c|c|c|c|c|c|c|c|c|c|}
\cline{2-12}
\rowcolor[HTML]{C0C0C0} 
\cellcolor[HTML]{FFFFFF} &
  \textbf{Alpha} &
  \cellcolor[HTML]{C0C0C0} &
  \cellcolor[HTML]{C0C0C0} &
  \cellcolor[HTML]{C0C0C0} &
  \cellcolor[HTML]{C0C0C0} &
  \cellcolor[HTML]{C0C0C0} &
  \cellcolor[HTML]{C0C0C0} &
  \cellcolor[HTML]{C0C0C0} &
  \cellcolor[HTML]{C0C0C0} &
  \cellcolor[HTML]{C0C0C0} &
  \cellcolor[HTML]{C0C0C0} \\ \cline{1-2}
\rowcolor[HTML]{C0C0C0} 
\multicolumn{1}{|c|}{\cellcolor[HTML]{C0C0C0}\textbf{Dataset}} &
  \textbf{Convolution Type} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{0.1}} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{0.2}} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{0.3}} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{0.4}} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{0.5}} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{0.6}} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{0.7}} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{0.8}} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{0.9}} &
  \multirow{-2}{*}{\cellcolor[HTML]{C0C0C0}\textbf{1.0}} \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  77.02±2.13 &
  77.85±1.7 &
  78.91±1.78 &
  \cellcolor[HTML]{9B9B9B}\textbf{79.37±1.78} &
  79.23±2 &
  77.63±2.53 &
  72.71±3.43 &
  60.67±4.09 &
  41.59±5.71 &
  31.19±2.59 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{Cora}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{9B9B9B}\textbf{78.12±2.27} &
  \cellcolor[HTML]{EFEFEF}78.08±2.01 &
  \cellcolor[HTML]{EFEFEF}77.97±2.16 &
  \cellcolor[HTML]{EFEFEF}77.83±2.1 &
  \cellcolor[HTML]{EFEFEF}77.81±2.07 &
  \cellcolor[HTML]{EFEFEF}77.73±1.91 &
  \cellcolor[HTML]{EFEFEF}77.3±2.24 &
  \cellcolor[HTML]{EFEFEF}77.33±2.21 &
  \cellcolor[HTML]{EFEFEF}77.24±2.01 &
  \cellcolor[HTML]{EFEFEF}77.1±2.2 \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  75.41±2.51 &
  76.09±2.38 &
  76.63±2.51 &
  \cellcolor[HTML]{9B9B9B}\textbf{76.79±2.48} &
  75.4±3.81 &
  68.27±6.82 &
  54.62±8.52 &
  44.11±5.36 &
  40.57±2.13 &
  39.9±2.09 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{pubMed}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{EFEFEF}74.42±3.82 &
  \cellcolor[HTML]{EFEFEF}74.62±3.76 &
  \cellcolor[HTML]{EFEFEF}74.74±3.65 &
  \cellcolor[HTML]{9B9B9B}\textbf{74.75±3.44} &
  \cellcolor[HTML]{EFEFEF}74.68±3.45 &
  \cellcolor[HTML]{EFEFEF}74.48±3.44 &
  \cellcolor[HTML]{EFEFEF}74.28±3.38 &
  \cellcolor[HTML]{EFEFEF}73.92±3.36 &
  \cellcolor[HTML]{EFEFEF}73.73±3.36 &
  \cellcolor[HTML]{EFEFEF}73.42±3.44 \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  65.06±2.74 &
  65.5±2.57 &
  66.51±2.49 &
  \cellcolor[HTML]{9B9B9B}\textbf{67.59±2.59} &
  67.45±2.56 &
  67.32±3.04 &
  65.53±3.58 &
  60.77±4.17 &
  51.41±4.86 &
  38.2±5.07 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{Citeseer}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{EFEFEF}66.69±2.87 &
  \cellcolor[HTML]{EFEFEF}66.75±2.99 &
  \cellcolor[HTML]{EFEFEF}66.79±2.69 &
  \cellcolor[HTML]{EFEFEF}66.99±3.07 &
  \cellcolor[HTML]{EFEFEF}66.74±2.71 &
  \cellcolor[HTML]{EFEFEF}66.91±2.84 &
  \cellcolor[HTML]{9B9B9B}\textbf{67.02±3.03} &
  \cellcolor[HTML]{EFEFEF}66.98±2.92 &
  \cellcolor[HTML]{EFEFEF}66.93±2.92 &
  \cellcolor[HTML]{EFEFEF}66.96±2.97 \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  \cellcolor[HTML]{9B9B9B}\textbf{93.02±0.27} &
  92.93±0.25 &
  92.9±0.21 &
  92.32±0.26 &
  88.66±0.39 &
  77.88±0.89 &
  52.34±1.93 &
  24.91±0.48 &
  22.63±0.29 &
  22.63±0.29 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{Coauthor CS}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{EFEFEF}88.97±0.4 &
  \cellcolor[HTML]{EFEFEF}89.4±0.43 &
  \cellcolor[HTML]{EFEFEF}89.74±0.41 &
  \cellcolor[HTML]{EFEFEF}90.01±0.45 &
  \cellcolor[HTML]{EFEFEF}90.13±0.51 &
  \cellcolor[HTML]{EFEFEF}90.21±0.5 &
  \cellcolor[HTML]{9B9B9B}\textbf{90.29±0.5} &
  \cellcolor[HTML]{EFEFEF}90.26±0.47 &
  \cellcolor[HTML]{EFEFEF}90.3±0.44 &
  \cellcolor[HTML]{EFEFEF}90.24±0.42 \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  58.77±18.34 &
  88.67±1.62 &
  \cellcolor[HTML]{9B9B9B}\textbf{90.09±1.02} &
  89.18±1.07 &
  83.23±1.64 &
  37.12±1.4 &
  33.17±0.93 &
  32±0.91 &
  29.05±0.94 &
  27.47±0.86 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{Amazon Photos}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{EFEFEF}82.09±1.34 &
  \cellcolor[HTML]{EFEFEF}83.75±1.13 &
  \cellcolor[HTML]{EFEFEF}84.97±1.34 &
  \cellcolor[HTML]{EFEFEF}86.15±1.26 &
  \cellcolor[HTML]{EFEFEF}86.75±1.44 &
  \cellcolor[HTML]{EFEFEF}87.33±1.18 &
  \cellcolor[HTML]{EFEFEF}88.01±1.11 &
  \cellcolor[HTML]{EFEFEF}87.9±1.2 &
  \cellcolor[HTML]{9B9B9B}\textbf{88.05±0.99} &
  \cellcolor[HTML]{EFEFEF}87.46±1.3 \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  27.06±0.61 &
  27.43±0.64 &
  27.74±0.61 &
  28.27±0.54 &
  28.58±0.57 &
  28.75±0.52 &
  \cellcolor[HTML]{9B9B9B}\textbf{28.9±0.56} &
  28.83±0.6 &
  28.51±0.67 &
  28.33±0.64 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{Actor}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{EFEFEF}29.42±0.73 &
  \cellcolor[HTML]{EFEFEF}29.97±0.54 &
  \cellcolor[HTML]{EFEFEF}30.33±0.65 &
  \cellcolor[HTML]{EFEFEF}30.71±0.71 &
  \cellcolor[HTML]{EFEFEF}31±0.83 &
  \cellcolor[HTML]{EFEFEF}31.26±0.71 &
  \cellcolor[HTML]{EFEFEF}31.37±0.72 &
  \cellcolor[HTML]{EFEFEF}31.53±0.6 &
  \cellcolor[HTML]{EFEFEF}31.58±0.6 &
  \cellcolor[HTML]{9B9B9B}\textbf{31.63±0.6} \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  51.04±4.08 &
  51.01±3.25 &
  51.04±2.91 &
  51.28±3.08 &
  51.25±3.12 &
  52.29±3.06 &
  53.65±3.21 &
  54.38±4.17 &
  \cellcolor[HTML]{9B9B9B}\textbf{54.55±3.97} &
  54.55±4.65 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{Cornell}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{EFEFEF}63.78±4.39 &
  \cellcolor[HTML]{EFEFEF}64.13±4.8 &
  \cellcolor[HTML]{EFEFEF}63.96±4.53 &
  \cellcolor[HTML]{EFEFEF}64.65±4.51 &
  \cellcolor[HTML]{EFEFEF}64.83±4.74 &
  \cellcolor[HTML]{EFEFEF}65.52±4.65 &
  \cellcolor[HTML]{EFEFEF}65.87±4.72 &
  \cellcolor[HTML]{EFEFEF}65.94±4.6 &
  \cellcolor[HTML]{9B9B9B}\textbf{66.46±4.5} &
  \cellcolor[HTML]{EFEFEF}66.39±4.61 \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  48.03±3.69 &
  49.39±3.5 &
  50.29±3.04 &
  51.09±3.15 &
  \cellcolor[HTML]{9B9B9B}\textbf{51.79±3.1} &
  51.6±3.65 &
  50.8±4.59 &
  49.92±5.04 &
  49.07±5.16 &
  48.21±4.75 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{Wisconsin}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{EFEFEF}50.24±4.88 &
  \cellcolor[HTML]{EFEFEF}51.04±5.38 &
  \cellcolor[HTML]{EFEFEF}51.28±5.77 &
  \cellcolor[HTML]{EFEFEF}52.03±6.07 &
  \cellcolor[HTML]{EFEFEF}52.91±6.32 &
  \cellcolor[HTML]{EFEFEF}53.6±5.82 &
  \cellcolor[HTML]{EFEFEF}54.4±6.01 &
  \cellcolor[HTML]{EFEFEF}55.28±6.06 &
  \cellcolor[HTML]{EFEFEF}55.76±6.01 &
  \cellcolor[HTML]{9B9B9B}\textbf{56.19±5.84} \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  \cellcolor[HTML]{F2F2F2}77.88±11.43 &
  \cellcolor[HTML]{F2F2F2}79.85±7.07 &
  \cellcolor[HTML]{F2F2F2}79.55±9.32 &
  \cellcolor[HTML]{F2F2F2}81.67±6.49 &
  \cellcolor[HTML]{F2F2F2}83.48±4.19 &
  \cellcolor[HTML]{9B9B9B}\textbf{84.39±3.03} &
  \cellcolor[HTML]{F2F2F2}84.09±2.88 &
  \cellcolor[HTML]{F2F2F2}84.09±2.88 &
  \cellcolor[HTML]{F2F2F2}84.09±2.88 &
  \cellcolor[HTML]{F2F2F2}84.09±2.88 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{PATTERN}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{EFEFEF}79.7±7.77 &
  \cellcolor[HTML]{EFEFEF}81.36±5.44 &
  \cellcolor[HTML]{EFEFEF}81.67±5.37 &
  \cellcolor[HTML]{EFEFEF}83.03±3.56 &
  \cellcolor[HTML]{EFEFEF}83.48±2.98 &
  \cellcolor[HTML]{EFEFEF}83.79±3.28 &
  \cellcolor[HTML]{9B9B9B}\textbf{84.24±2.78} &
  \cellcolor[HTML]{EFEFEF}83.33±4.04 &
  \cellcolor[HTML]{EFEFEF}82.58±3.13 &
  \cellcolor[HTML]{EFEFEF}83.48±2.62 \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  \cellcolor[HTML]{F2F2F2}36.14±15.88 &
  \cellcolor[HTML]{F2F2F2}34.29±10.24 &
  \cellcolor[HTML]{F2F2F2}39.86±12.84 &
  \cellcolor[HTML]{F2F2F2}32.71±11.16 &
  \cellcolor[HTML]{F2F2F2}37.29±14.56 &
  \cellcolor[HTML]{F2F2F2}38.14±12.09 &
  \cellcolor[HTML]{F2F2F2}38.29±14.69 &
  \cellcolor[HTML]{9B9B9B}\textbf{42.43±6.43} &
  \cellcolor[HTML]{F2F2F2}31.57±8.87 &
  \cellcolor[HTML]{F2F2F2}23.71±4 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{CLUSTER}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{9B9B9B}\textbf{45.29±11.74} &
  \cellcolor[HTML]{EFEFEF}37.71±11.39 &
  \cellcolor[HTML]{EFEFEF}41.43±9.52 &
  \cellcolor[HTML]{EFEFEF}37.71±9.69 &
  \cellcolor[HTML]{EFEFEF}38.14±11.19 &
  \cellcolor[HTML]{EFEFEF}37.86±10.91 &
  \cellcolor[HTML]{EFEFEF}32.43±10.01 &
  \cellcolor[HTML]{EFEFEF}33.29±9.92 &
  \cellcolor[HTML]{EFEFEF}33.29±12.25 &
  \cellcolor[HTML]{EFEFEF}28.14±11.61 \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  \cellcolor[HTML]{F2F2F2}26.38±3.18 &
  \cellcolor[HTML]{F2F2F2}30.83±3.21 &
  \cellcolor[HTML]{F2F2F2}51.85±5.45 &
  \cellcolor[HTML]{F2F2F2}64.36±3.96 &
  \cellcolor[HTML]{9B9B9B}\textbf{66.02±3.87} &
  \cellcolor[HTML]{F2F2F2}59.54±2.37 &
  \cellcolor[HTML]{F2F2F2}51.9±4.56 &
  \cellcolor[HTML]{F2F2F2}39.53±2.61 &
  \cellcolor[HTML]{F2F2F2}33.64±3.56 &
  \cellcolor[HTML]{F2F2F2}31.89±1.76 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{WikiCS}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{EFEFEF}52.14±4.28 &
  \cellcolor[HTML]{EFEFEF}58.89±2.14 &
  \cellcolor[HTML]{EFEFEF}60.69±2.51 &
  \cellcolor[HTML]{EFEFEF}62.56±5 &
  \cellcolor[HTML]{EFEFEF}64.33±2.81 &
  \cellcolor[HTML]{EFEFEF}66.68±3.59 &
  \cellcolor[HTML]{EFEFEF}68.91±2.15 &
  \cellcolor[HTML]{EFEFEF}69.01±2.19 &
  \cellcolor[HTML]{EFEFEF}68.08±3.37 &
  \cellcolor[HTML]{9B9B9B}\textbf{69.23±3.98} \\ \hline
\multicolumn{1}{|c|}{} &
  \textbf{Symmetric} &
  \cellcolor[HTML]{F2F2F2}16.12±0.05 &
  \cellcolor[HTML]{F2F2F2}16.4±0.39 &
  \cellcolor[HTML]{F2F2F2}17.14±1.03 &
  \cellcolor[HTML]{F2F2F2}16.85±1.7 &
  \cellcolor[HTML]{F2F2F2}16.14±0.03 &
  \cellcolor[HTML]{F2F2F2}16.44±0.2 &
  \cellcolor[HTML]{F2F2F2}16.95±0.08 &
  \cellcolor[HTML]{F2F2F2}17.18±0.11 &
  \cellcolor[HTML]{9B9B9B}\textbf{17.19±0.08} &
  \cellcolor[HTML]{F2F2F2}16.81±0.19 \\ \cline{2-12} 
\multicolumn{1}{|c|}{\multirow{-2}{*}{\textbf{OGBN-arxiv}}} &
  \cellcolor[HTML]{EFEFEF}\textbf{Row-Normalized} &
  \cellcolor[HTML]{9B9B9B}\textbf{17.35±0.2} &
  \cellcolor[HTML]{EFEFEF}17.07±0.33 &
  \cellcolor[HTML]{EFEFEF}16.9±0.08 &
  \cellcolor[HTML]{EFEFEF}16.69±0.12 &
  \cellcolor[HTML]{EFEFEF}16.57±0.12 &
  \cellcolor[HTML]{EFEFEF}16.36±0.27 &
  \cellcolor[HTML]{EFEFEF}16.18±0.1 &
  \cellcolor[HTML]{EFEFEF}16.14±0.03 &
  \cellcolor[HTML]{EFEFEF}16.14±0.03 &
  \cellcolor[HTML]{EFEFEF}16.14±0.03 \\ \hline
\end{tabular}
}
\caption{Results(accuracy) of node classification task for 12 benchmark datasets. The number of experiments differs by each dataset. Batch normalization has been applied to \textit{PATTERN}, \textit{CLUSTER}, \textit{WikiCS}. Details for the experiments are provided in Table~3 in Appendix~E.}
\label{tab:experiments-combined}
\end{table*}


We emphasize here that the scope of this paper is not to suggest another convolution operator that would achieve state-of-the-art results. Rather, through this series of experiments, we hope to have convinced the reader that, empirically, the choice of convolution is important and can help gain up to almost 7\% accuracy on traditional GNN approaches, with no modifications to the architecture of the network whatsoever. Motivated by these observations, the  rest of this paper focuses on analysing these convolution operators, and their impact on the organization of the embedding space.
%We attempt in particular to understand how these estimators relate to the geometry of the data in order to understand the phenomena observed in this first section.

\section{Geometry of GNN embeddings in latent space} \label{sec:geometry}
% !TEX root = main.tex
In this section, we analyse the effect of the convolution operator on the global organization of the latent space. Our objective is to (a) characterize the implicit constraints that these operators put on the geometry (in particular, that embeddings concentrate differently according to their degree and choice of the operator), and (b) identify the downstream consequences in terms of performance. This discussion is driven by considerations on nodes' topological  characteristics --- rather than spectral arguments.
%\vspace{-0.3cm}
%In particular, in our discusson we will consider three types of main topologies: core-peripheries. Note that we choose not to characterise the effect of the convolution on Erdos-Renyi topologies, as they do not capture realistic structures.


% Consider the convolved node features:
% $$ \tilde{H} = SXW  + b_k$$

% The analysis of complex systems is in fact rather involved, and unfortunately, requires specifying the distribution of the node degree. However, there is no standard way of accounting for the similarity.

\subsection{Symmetric Convolutions} 

We begin our study of the ``absolute'' latent geometry of our embeddings with the family of symmetric convolution $\mc{M}_{\alpha, \beta}$ (see Equation~\ref{eq:normalized}). 
For each layer $K$, the embedding is defined as:
%Contrary to many theoretical analysis, our study here is based on an analysis of the last (linear) layer of the Graph Neural Network (see Equation~\ref{eq:last_layer}), since it has the benefit of being a simple linear layer on the node embeddings (since introduction), denoted as:
\begin{small}
    \begin{align*}
    H^{(K)} = S\s( H^{(K-1)}W+b) = \sum_{v \in \tilde{\mc{N}}(u)} \frac{A_{uv}}{(d_u +\beta)^{\alpha}(d_v+\beta)^{\alpha}} Z_{v\cdot} 
    \end{align*}
\end{small}
where $\tilde{\mc{N}}(u)= \mc{N}(u)\cup \{u\}$ denotes the neighborhood of node $u$, $A_{uv}$ is the (possibly weighted) adjacency matrix, with diagonal equal to $\beta$, and $Z_{v\cdot}  = \s( H^{(K-1)}_{v\cdot}W+b)$.


As a first step to study of the effect of the choice of convolution operator on the latent embedding space, we propose the following lemma.
\begin{lemma}\label{lemma:dis}
For any node $u$, the effect of the convolution can be characterized as follows:
\begin{equation}
\begin{split}
  ||S Z||_2 \leq { ||Z||_{2,\infty}}  \Big ( ( d_u + \b)^{1-2\alpha} -  \alpha  \frac{\bar{\Delta}_u}{(d_u + \beta)^{2\alpha}} \\+ \frac{\alpha(\alpha+1)M}{2}  \frac{\overline{\Delta^2}_u}{(d_u + \beta)^{1 + 2\alpha}} \Big) \label{eq:ineq3}
\end{split}
\end{equation}
where $\bar{\Delta}_u$ (respectively $\overline{\Delta^2}_u$) are the weighted averages of the degree differences (respectively, squared degree differences): 
\begin{footnotesize}
\begin{align*}
\bar{\Delta}_u = \frac{ \sum_{v \in \tilde{\mc{N}}(u)} A_{uv} (d_v -d_u)} {d_u + \beta} \text{, } \overline{\Delta^2}_u = \frac{ \sum_{v \in  \tilde{\mc{N}}(u)} A_{uv} (d_v -d_u)^2} {d_u + \beta}.
\end{align*}
\end{footnotesize}
In this equation, $||Z||_{2, \infty} = \max_{v} ||Z_v||_2$, and $M =\frac{d_{\max} +\beta }{\beta +1})^{2}$, where $d_{\max}$ denote the maximal degree of the nodes in the network.
\end{lemma}

\begin{proof}
The proof is simple (see  Appendix B), and relies on the triangle inequality coupled with a MacLaurin expansion of the function $d_v \to \frac{1}{(d_u +\beta)^{\alpha}(d_v +\beta)^{\alpha}} $. 
\end{proof}

Note that this bound is not necessarily tight. In particular, the proof relies on an application of  the triangular inequality, along with H{\"o}lder's inequality to separate the convolution from the embeddings.   However, while potentially crude, this bound already allows us to shed more light on the behaviour of the embedding as a function of the parameters $\a$, $\b$, and their topology. In particular, this bound allows to highlight:
\begin{description}[noitemsep,leftmargin=0.5cm, topsep=0em]
\item[(a) The role of $\alpha$.] The leading term in inequality \ref{eq:ineq3} is $d_u^{1-2\a}$, and offers a first explanation for the change of phase we have observed in some of our experiments in the previous section. For values of $\alpha<0.5$, this term is an increasing function of the degree: after even a single convolution, nodes with a small degree have less leeway to spread, and will generally remain close to the origin; High-degree nodes, on the other hand, will enjoy greater variance after each convolution and be able to spread to greater radii. Conversely, if $\alpha>0.5$, the upper bound decreases for nodes with high degree --- forcing them to concentrate around the origin.  The parameter $\a$, therefore, controls the "attraction" of nodes towards the origin as a function of their degree.
\item[(b) The effect of $\beta$.] The coefficient $\beta$, on the other hand, acts as added mass to the degree $d_u$ and can be understood as the "strength" of the attraction: for $\a>0.5$, the attraction of high-degree nodes to the origin is an increasing function of $\b$. Conversely, for $\a<0.5$, the repulsion of the nodes from the origin is an increasing function of $\beta$. 
\item[(c) The influence of the surrounding topology.] As previously exhibited, the node degree plays a defining role in the variance of the embedding. The bound also exhibits a dependency on the neighborhood topology through the terms $\bar{\Delta}_u$ and $\bar{\Delta^2}_u$. Therefore, the more topologically homogeneous the neighborhood, the lesser the variance. % Moreover, as shown in the proof, $\beta$ also controls the variable $M$: the higher the $\beta$, the lesser the impact of the neighborhood.
\end{description}
%\vspace{-0.1cm}

\xhdr{Consequences} The previous observations yield two main conclusions. First, the choice of the convolution vector drives the density of the embedding space: values of $\beta$ and $\alpha$ allow the embedding space to expand or contract around the origin, depending on the node degree.\\ %By induction, in a similar fashion to \cite{oono2019graph}, it is possible to show that consecutive convolutions followed by 1-Lipschitz (e.g. ReLU) non-linearities contract even further the embedding towards the origin.\\
The second consequence pertains to the accuracy of the recovery. Since the last layer of GNN is a linear classifier, we can use known results from statistical theory about the influence of the different points on the performance of the algorithm.  In particular, in linear regression, it is known that high-leverage points (that is, points with ``extreme'' predictor values'') are more likely to be highly influential points \cite{weisberg2005applied} (the same follows for generalized linear models, with some nuances). As such, by preventing high-degree nodes (respectively low-degree) from taking on extreme embedding value and concentrating them around the origin, the convolution operator implicitly limits the amount of trust, or leverage, that these points may have. We summarize these observations in the following lemma.

\begin{lemma}[Effect of symmetric convolutions on node embeddings]
In networks with non-homogeneous degree distributions, the exponent $\alpha$ constrains the leverage associated with each of the embeddings as a function of their degree and local topology:
\begin{itemize}[noitemsep, topsep=0em, leftmargin=0.5cm]
    \item{\bf For values of $\a>0.5$:} High-degree nodes are constrained to concentrate around the origin, allowing the performance of the algorithm to be driven by more peripheral (low-degree) nodes.
    \item{\bf For values of $\a< 0.5$:} Low-degree nodes need to lie closer to the origin than their high-degree counterparts, thereby allowing high-degree nodes to have higher leverage and potentially become more influential.
\end{itemize}
\end{lemma}

% \xhdr{Experiments} To 
% The study of the lower bound is a little bit more complex. Suppose that the network is homophilic, so that $H_{v}^{(k)}W = H_{u}^{(k)}W + \tilde{\Delta}_u $  for all $v$ in the neighborhood of $u$. Here, $\tilde{\Delta}_v$ denotes a perturbation of $H_{u}^{(k)}W$ for which (as a by-product of the homophily assumption), we can assume that $||\tilde{\Delta}_v||_2/|| H_{u}^{(k)}W||_2$ is small. Then, coordinate wise, assuming that $H_{u}^{(k)}W$ is nonnegative (the converse will follow by flipping the sign of the following inequality), we can see that:
% $$H_{u a}^{(k)} - b^{(k)}_a = \frac{1}{(d_u + \b)^{2\a} } \sum_{v \in \mathcal{N}(u)\cup\{u\}} \frac{A_{uv} ( H_{u a}^{(k-1)}W + \tilde{\Delta}_v)}{( d_v + \beta )^{\a}}   \geq\frac{(d_u + \b)^{1-2\a} }{(d_{\max} + \b)^{\a}} +o(1)   $$
% Therefore, this proves that, as $\alpha$ increases, the embeddings are progressively placed in on the line defined by the different projection directions.
\xhdr{Experiments} To illustrate these bounds and check their validity, we perform a set of synthetic experiments. We generate a set of four cliques on 20 nodes (the ``hubs''). To each node in each clique, we add a link to a sparse Barabasi-Albert network on 10 nodes ('the periphery'), with parameter $m=1$, so that the average degree of the periphery is low ($\approx 1$). The peripheries are endowed with the same class label as their associated hub. Finally, we ensure that the network is connected by randomly connecting the hubs together (one new random link per hub). To generate node features, we take the first $k=4$ features to be the one-hot label vector, to which we concatenate 16 additional ``dummy features''(random Gaussians). We finally add Gaussian noise with scale $\s^2=4$ entrywise: the result is  a feature vector that is only weakly indicative of the class.  The trained embeddings are presented in Figure \ref{fig:exp_part2}. Note the lack of separability of the different classes based on their raw feature vectors, as captured by the PCA plot in the left column. As expected from our bounds, we observe an inversion of the geometry around the origin as $\a$ increases: high-degree nodes shift from the outskirts of the plot to being concentrated around the origin as $\a$ increases. We also refer the reader to Appendix~E.3.2, in which we also verify these phenomena in standard datasets by providing PCA and UMAP plots of the corresponding node embeddings.


\begin{figure*}[h]
    \centering
    \includegraphics[width=0.9\textwidth]{NEW_FIGURES/hubs_and_spokes_panel_norm.002.png}
    \caption{Symmetric Embeddings, plotted using the first two principal components (left), and the raw latent embeddings (or `latent components' (LC), shown in the right three plots on each row). Note the inversion: the high-degree nodes migrate from the periphery of the latent space ($\a=0.2$) to the origin $(\a=0.7)$. See Appendix~E for the equivalent plot for row-normalized Embeddings.}
    \label{fig:exp_part2}
\end{figure*}

In order to test lemma~\ref{lemma:dis}, we modify this experiment slightly, and now let the variance of the noise depend on the degree  $d_u$ of the node: $\s^2_u=e^{ 3 (-1.5 + \log(d_u))}$. This means that low-degree nodes here have very small variance $\s<1$, while high-degree noise is extremely noisy $\s \approx 9$. Consequently, we expect that the geometries in which the high-degree nodes are placed on the outskirts (and have more leverage), and low-degree nodes are constrained to lie close to the origin will perform poorly. Conversely, we flip this scenario (we choose $\s^2_u=e^{ 3 (-1.5 + \log(d_{(n-u)})}$ where, if $d_u$ has rank $rk(u)$,  $d_{(n-rk(u))}$ is the degree of the  $n-rk(u)$ largest node), and expect the opposite phenomenon. The results are shown in Figure\ref{fig:exp_part2a}(a), and are well aligned with our expectations.
% Similarly, for a GNN with reLU layers:
%  \begin{align*}
% || M_{\alpha, \beta} H^{(k-1)}W||_{2} &\geq \frac{1}{(d_u + \b)^{2\a} } \sum_{v \in \mathcal{N}(u)} \frac{1}{( 1 +  \frac{\Delta_v}{d_u + \beta} )^{\a}} \min ||H^{(k-1)}W||_{2}\\
% &\geq \frac{ d \min || H^{(k-1)}W||}{(d_u + \b)^{2\a} } \sum_{v \in \mathcal{N}(u)}  ( 1 -\alpha  \frac{\Delta_v}{d_u + \beta} + \frac{\alpha(\alpha+1)}{2}  \frac{\Delta^2_v}{(d_u + \beta)^2} M) \\
% &\leq { || H^{(k-1)}W||_{2,\infty}}  \Big ( ( d_u + \b)^{1-2\alpha} -  \alpha  \frac{\bar{\Delta}_v}{(d_u + \beta)^{2\alpha}} + \frac{\alpha(\alpha+1)}{2}  \frac{\overline{\Delta^2}_v}{(d_u + \beta)^{2\alpha}} M\Big)\\
% \end{align*}
 %\end{description}
% To get a lower bound, one would need to have a model for the amount of homophily in the network.


% We also need to prove that this holds for multiple levels. This holds because we have a contraction.

\subsection{Row-normalized Convolutions}
%\vspace{-0.2cm}

Let us now turn to the case of row-normalized convolutions. In this case, the convolutions write as:
\begin{equation*}
    \begin{split}
    SX &=\frac{\frac{1}{(d_u+\b)^{\a}} \sum_{v \in \mc{N}(u) \cup\{u\}} \frac{1}{(d_v+\b)^{\a}}X_{v}}{\sum_{v \in \mc{N}(u)} \frac{1}{(d_u+\b)^{\a}}  \frac{1}{(d_v+\b)^{\a}} } \\
    &=\frac{1}{\sum_{v \in \mc{N}(u)} \frac{1}{(d_v+\b)^{\a}} } \sum_{v \in \mc{N}(u) \cup\{u\}} \frac{1}{(d_v+\b)^{\a}}X_{v}
    \end{split}
\end{equation*}

The embedding now lies within the convex hull of its neighbors, whose contributions are inversely proportional to their degrees. This reduces the sensitivity of variance of the node embeddings to their degree. The decay of a neighbor's contribution is fact an increasing function of $\a$. The latter can be compared to a form of attention that effectively filters out nodes with high-degree: here the discounting procedure is not learned but imposed ahead of time. Appendix~E and Figure\ref{fig:exp_part2a}(b) show the result of the same experiments as in the last subsection. As in the previous part, the results are less dependent on the value of $\a$.

\begin{figure*}[h]
\begin{subfigure}[b]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{NEW_FIGURES/plothubs_spokes_varying_noise.png}
\caption{Results for symmetric convolutions.}
\end{subfigure}
\begin{subfigure}[b]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{NEW_FIGURES/plothubs_spokes_varying_noise_diffusion.png}
\caption{Results for row-normalized convolutions.}
\end{subfigure}
\caption{Results for 50 independent trials of our second experiment using symmetric (left) and row-normalized (right) convolutions. Note that the accuracy of the symmetric convolutions increases  $\a$ increases when the noise is higher in high-degree nodes, and decreases in the opposite scenario.}
\label{fig:exp_part2a}
\end{figure*}




\section{Inherent embedding geometry} \label{sec:geometry-intrinsic}

We now turn to the analysis of the evolution of the relative distances between embedding points. % as a function of the convolution's parameters $\a$ and $\beta$ in $\mc{M}_{\a, \b}$ and $\mc{F}_{\a, \b}$, the nodes' local topology along with their feature vectors. 
Graph Neural Networks can indeed be understood as a data integration method: the adjacency matrix and the node features provide two complementary views of the data, and GNNs provide a convenient way of combining this information to create informative embeddings. 
%However, the rest of their analysis is driven by considerations on the spectrum of this kernel --- therefore making it difficult to relate it back to local properties of the graph\cite{balcilar2020analyzing}. 
The amount to which these embeddings depend on one side of the data (i.e. graph vs node features) remains to be determined.


% \xhdr{Embedding structural features} The two families of convolution operators differ radically in their encoding of topological features. To illustrate this, consider a simple two-layer GCN such as the one suggested by Kipf et al. \cite{kipf2016semi}.  In this case, for the directions in which the term is positive,  the embedding $H^{(K)}$ (i.e. the transformed features fed into the last linear layer) can be written as:

% \vspace{-0.35cm}

% \begin{align*} 
% \resizebox{0.49\textwidth}{!}{
% H_{u\cdot}= \sum_{k=1}^d \sum_{\substack{
%         v \in \tilde{\mc{N}}(u)  \\
%         (S XW + b)_{vk} \geq 0
%     }} \big( S_{uv}(S XW)_{vk}W^{(2)}_{k\cdot}+ S_{uv}b_{vk}W^{(2)}_{k\cdot} \big)
%     }
% \end{align*}
% \vspace{-0.35cm}

%     The embedding is thus the sum of two components: a function of a (subset of) neighboring node features and a term on the right that is related to  the local topology. %To see why this is the case, consider a scenario where nodes in $\mc{N}(u)$ are all such that $ (S XW + b)_{vk} \geq 0$ for all $k$ or $ (S XW + b)_{vk} < 0$ for all $k$. Denote $\tilde{A}(u) = \{ v \in \tilde{N}(u):\quad (S XW + b)_{vk} \geq 0 \quad \text{for all } k \}$. 
%     %In this case, the previous equation becomes:
% % $H_{u\cdot}= \sum_{
% %         v \in {\tilde{A}(u)}} \big( S_{uv}(S X\tilde{W})_{v\cdot} + S_{uv}\tilde{b}_{v\cdot} \big)$, with $\tilde{b} = bW^{(2)}$ and $\tilde{W}=WW^{(2)}$.
 
% Using this notation, it becomes clear that:

% \begin{itemize}[noitemsep, topsep=-0.8em, leftmargin=0.5cm]
%     \item {\bf For row-normalized embeddings}, for two nodes to be close, it is sufficient for their neighborhood to have similar features. the distance between two features can be related to an ANOVA test between two neighborhoods, and is devoid of topological information:
%     \begin{small}
%     \begin{align*}
%        ||H_{u\cdot} - H_{u'\cdot}||^2 = \sum_{j=1}^p \Big(  \overline{(S X\tilde{W})_{\cdot j}}_{\tilde{A}(u)} -  \overline{(S X\tilde{W})_{\cdot j}}_{\tilde{A}(u')} \Big)^2, 
%     \end{align*}
%     \end{small}
%     where $\overline{X}_{\tilde{A}(u)}$ denote the mean of $X$ over the set $\tilde{A}(u)$. Consequently, these  embeddings are appropriate when presumably, topology does not carry information relevant to the prediction.
%     \item {\bf For symmetric embeddings}, on the other hand, the embedding depends strongly on the topology:
%     \begin{small}
%     \begin{align*}
%         H_{uk}= \sum_{
%         v \in {\tilde{A}(u)}} \Big(\sum_{
%         w \in \mc{N}(v)} \frac{(X\tilde{W})_{wk} }{(d_u + \beta)^{\a}(d_v + \beta)^{2\a} (d_w + \beta)^{\a}}\Big) \\
%         + \sum_{v \in {\tilde{A}(u)}} \frac{ \tilde{b}_{k}}{(d_u + \beta)^{\a}(d_v + \beta)^{\a}}
%     \end{align*}
%     \end{small}

% In this case, note the existence of a bias term that is a function of both the source node's degree and that of its neighbours: this bias term can be seen as an offset that differentiates between high-degree nodes and low-degree nodes.
% \end{itemize}

 Let us try and study the effect of the different distances (graph and topology) through two toy examples controlling for feature similarity and node similarity separately.

\xhdr{Toy example 1: Identical Topologies, Different features} 
Consider two nodes $u$ and $v$ have structurally similar neighborhoods (i.e., there exists a mapping $\phi$ that transforms each node in the neighborhood of $v$ into its corresponding one in the neighborhood of $u$, see Appendix C (Figure~1), but whose feature vectors are different. Mathematically, we write:
\begin{align*}
\forall j \in N(v), \quad  X_{j\cdot} = X_{\phi(j)\cdot} +\epsilon, \qquad \e_{jk} \overset{\text{i.i.d}}{\sim}N(0,\s^2).  \end{align*}
%where $\epsilon$ is a vector with independent sub-Gaussian entries with parameter $\sigma$.

\begin{lemma} \label{lemma:inh}
    For symmetric convolution, with probability at least $1-\delta$, with $M$ as in \ref{lemma:dis}, we have: 
    \begin{align*}
   & ||H_u -H_{u'}||^2  \leq \mu +  2 \sqrt{2}\s||W||_2(d_u +\beta)^{1-2\a} \\
    &\times \sqrt{1 +  {2\a |\overline{\Delta}_u |} +  \a(2\a+1)\frac{M \overline{\Delta^2}_u}{d_u}} \log(1/\delta) 
    \end{align*}
      where $\mu = \sigma^2||W||^2 (d_u +\beta)^{2-4\a} 
 \Big( 1 +  2\a |\overline{\Delta}_u | +  \a(2\a+1) M \frac{\overline{\Delta^2}_u}{d_u}\Big)$.
      
    Conversely, for row-normalized embeddings:
    %\vspace{-0.2cm}
    \begin{align*}
    ||H_u -H_{u'}||^2 \leq & \mu + 2 \sqrt{2}\s||W||_2 \\
    &+ (\sum_{v \in \tilde{\mc{N}}(u)} \frac{1}{(d_v +\beta)^{2\a}}\big)^{1/2}\log(1/\delta)\\
    \end{align*}
    where  $\mu = \frac{\sigma^2\|W\|^2}{\sum_{v \in \tilde{\mc{N}}(u)} (d_v +\beta)^{-\a}} \frac{1}{1+\beta}\leq \beta \sigma^2\|W\|^2$  
    for $\beta\geq 1.$
\end{lemma}
The proof is in Appendix C. Symmetric embeddings will thus shrink distances between nodes depending on their topology, so that the cluster density is a function of the node degree. The impact of the topology is expected to be more marked at the extremities of the spectrum of $\a$. For row-normalized embeddings however, the degree of the node does not affect the distance between embeddings as much: row-normalized embeddings encode attributes, rather than topological information.

\xhdr{Toy example 2: Identical Features, structurally different neighbourhoods} 
Conversely, $u$ and $v$ have radically different neighborhoods from a topological perspective, but have similar features:
\[\forall j \in \tilde{\mc{N}}(u)\cup \tilde{\mc{N}}(u'), X_j = \bar{X} +\epsilon \]
In this case, for row-normalized embeddings, the difference will be 0: the embeddings are therefore more sensitive to the node feature values than the symmetric convolution.
Conversely, for symmetric convolutions,  we can also show (see Appendix C) that the leading term for the difference is of the order of  $(d_u+\beta)^{1-2\a} - (d_{u'}+\beta)^{1-2\a}$ --- and is, therefore, a decreasing function of $\a$.

\xhdr{Experiments} We illustrate these results by running a set of final experiments. We generate a structurally equivalent networks (see smaller replica in Figure~1a in Appendix C), and evaluate the distance between untrained embeddings (using a 2-layer GCN architecture). The mean distance over 100 experiments is presented in Appendix C in  Figure~1b, and a visualization of the latent space for symmetric convolutions is shown in Appendix C, Figure~1c. As expected, the density of the cluster of structurally equivalent high-degree nodes (cliques on 40 nodes), in grey varies as a function of $\a$. In Appendix~E.3.3, we also provide visualization of the interplay between node features and topologies on a subset of real datasets, using the Gromov Wasserstein distance as a way of measuring the distance of the embedding space with the original characteristics of the dataset (features and adjacency matrix).



\xhdr{Toy Example 3: Degree corrected Stochastic SBM} We conclude this section by considering a specific family of graphs: the degree-corrected Stochastic Block Model \cite{karrer2011stochastic} on two classes of equal size $n$. Let each node have class $Z_i \in \{1, 2\}$, and denote 
 $X_i = \mu^{(Z_i)} + \epsilon_i$ its attributes. According to the DC-SBM model, each edge in the network is sampled according to a Bernouilli distribution:
$ A_{ij}  \sim \text{Bernouilli}(\theta_i \theta_j \omega_{Z_iZ_j}),$ where $\theta_i$ is a popularity parameter such that, for each group $g$:
$ \sum_{i=1}^n \theta_{i}1_{Z_i=g} = n,$
where $\omega_{ij}$ is the parameter of the model corresponding to the probability of connection between group $i$ and $j$. Note that, under this model, the expected number of edges from community $(i)$ to $(j)$ is simply $m_{ij} =n^2 \omega_{ij}$. Therefore, picking $\forall i, \theta_i=1$ corresponds to the traditional stochastic block model. 

In this case, it is possible to show (see Appendix D) that the mean of the symmetric embeddings is directly proportional to their popularity parameter $\theta_i^{1-\alpha}$.
Consequently, for $\alpha=1$, the leading term is independent of $\theta_i$. Reciprocally, for $\alpha=0$, the embedding is directly proportional to $\theta_i$. This means that the embeddings will on average have a norm that is proportional to  their popularity: the embedding space thus capture the "popularity" of the embeddings through their degree.

To see this, we provide the following example. Consider a DC-SBM graph on 300 nodes with two classes, with connectivity parameters $\omega_{11}=\omega_{22}=0.1$ and  $\omega_{11}=\omega_{22}=0.005$. The features here are taken to be multivariate normal with  $\mu^{(1)}=2,\mu^{(2)}=-2$ and standard deviation equal to 4. We generate the $\theta_i$ for each group from a lognormal distribution, with mean 0 and standard deviation 1 (see details in Appendix D). In the results (Fig~\ref{fig:emb} and Appendix D), we observe as predicted the high dependency of the embedding on the "popularity" parameter for low values of $\alpha$.


\begin{figure}
         \centering
         \includegraphics[width=\textwidth]{FIGS/embs_new.png}
         \caption{Embeddings after one convolution for two different values of $\alpha\in \{0,1\}$, and $\beta=1.$ Note how the value of the popularity parameter $\theta$ drives the geometry of the embedding space when $\alpha=0.$}
         \label{fig:emb}
\end{figure}


%\vspace{-0.3cm}
\section{Discussion} \label{sec:properties}
To summarize, this paper has tried to explicit two main phenomena: (a) the dependency of the variance of the embeddings on the degree, choice of convolution operator and parameter $\alpha$; (b) The higher sensitivity of the symmetric embedding distances to topological features compared to that of the row-normalized one.
We now conclude by discussing the practical impact of these observations. Consider the two following use cases:
 \begin{description}[noitemsep, topsep=0em, leftmargin=1em]
 \item[(a) Learning user embeddings in a social network,] such as for instance in \cite{zitnik2018modeling}. Here, the degree can be considered as an additional dimension of information: users with high degree might be more popular or sociable, and therefore more alike to one another. This information should be reflected in the embedding space. In the first, a symmetric embedding --- which is typically more sensitive to distance between topological features --- might be more suitable to the task. Moreover, using our results on the degree-corrected stochastic block model, the lower the $\alpha$ chosen, the higher the potential emphasis on the degree.
 \item[(b) Learning drug embeddings in biological network] (e.g. \cite{zitnik2018modeling}) In this case, the degree of the node might not necessarily be as informative: some drugs may have been on the market for longer, and/or their mechanisms of actions are better understood.  In that case, the features of a drug's neighbors might be informative, but not necessarily their degree.   Conversely, a row-normalized embedding might prove a better choice. Alternatively, the symmetric convolutions with $\alpha=1$ would mitigate the effect of the sampling density. We note however that the rapid contraction of the embeddings towards 0 makes it difficult for the GNN to learn informative embeddings.
 \end{description}


%\vspace{-0.3cm}
\section{Conclusion} \label{sec:conclusion}
In conclusion, in this paper, we have shown that the choice of the convolution operator has fundamental consequences on the geometry of the embedding space: symmetric convolutions are generally more sensitive to the topology, and encode it in the embedding. In that case, the choice of $\a$ amounts to selecting ``who to trust'': high-values of $\a$ push high-degree nodes towards the origin, thereby limiting their leverage. Conversely, row-normalized are more limited in the amount of topological information that they carry, and convolutions are more robust to the choice of $\a$ --- this is probably a better choice when the data is assumed to be sampled from a manifold (e.g. point cloud data).
Our analysis --- which we hope to be insightful --- has room for further improvement. Our reasoning relies on upper bounds which, while providing intuition, are not extremely tight, and could be complemented with lower bounds to fully characterize the behavior of the geometry. All experiments resort to using GCN types of architectures. However, we believe that the intuition and guidelines that we derived from this analysis will nonetheless hold for other types of architectures.


\setcounter{tocdepth}{10}

%\begin{contributions} % will be removed in pdf for initial submission 
					  % (without ‘accepted’ option in \documentclass)
                      % so you can already fill it to test with the
                      % ‘accepted’ class option
%    Briefly list author contributions. 
%    This is a nice way of making clear who did what and to give proper credit.
%    This section is optional.

%\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
The authors are grateful for the support of the University of Chicago’s Research Computing Center for assistance with the calculations carried out in this work, as well as for a generous award from Facebook Research (2021 Proposal for Statistics for Improving Insights, Models, and Decisions) that supported this research.
\end{acknowledgements}


% References
\bibliography{references}
\end{document}
