\documentclass[accepted]{uai2022} 
%% In your camera-ready you should use the 'accepted' parameter. This shows the authors and how an accepted paper will look like. The footer is 'Acccepted for X'. In the final version, the proceedings chairs will add the page numbers for PMLR and the final footer will be 'Proceedings of X'.
%
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

\usepackage[american]{babel}

\usepackage{times} %
\usepackage{helvet} %
\usepackage{courier} %
\usepackage{graphicx} %
\usepackage{natbib}  %
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{caption}  %
\DeclareCaptionStyle{ruled}%
 {labelfont=normalfont,labelsep=colon,strut=off}

\usepackage{soul}
\usepackage[utf8]{inputenc}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{mathrsfs}
\usepackage{cleveref}
\usepackage{booktabs}
\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{bm}
\usepackage{bbm}
\usepackage{subfig}
\usepackage{amssymb}
\usepackage{enumitem}
\usepackage{tikz}
\usepackage{soul}
\usepackage{float}
\urlstyle{same}





\newenvironment{proofsketch}{%
  \renewcommand{\proofname}{Proof Sketch}\proof}{\endproof}
\newcommand{\indep}{\rotatebox[origin=c]{90}{$\models$}}
\def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}}

\theoremstyle{dfn}
\newtheorem{dfn}{Definition}

\theoremstyle{proposition}
\newtheorem{proposition}{Proposition}

\newtheorem{theorem}{Theorem}
\newtheorem{corrollary}{Corollary}

\newtheorem{assumption}{A}

\newcommand{\commentout}[1]{%
}

\newcommand{\Cov}{\mathrm{Cov}}

\newcommand{\fixme}[1]{{\textcolor{red}{\textit{#1}}}}
\newcommand{\red}[1]{{\textcolor{red}{\textit{#1}}}}
\newcommand{\green}[1]{{\textcolor{green}{\textit{#1}}}}
\newcommand{\blue}[1]{{\textcolor{blue}{\textit{#1}}}}

\newcommand{\david}[1]{{\textcolor{brown}{David: \textit{#1}}}}

\newcommand{\relx}{\sigma(v_i, \mathbf{x}, G)}

\newcommand{\ba}{Barab\'asi-Albert}
\newcommand{\er}{Erd\H{o}s-R\'{e}nyi}
\newcommand{\ws}{Watts-Strogatz}
\newcommand{\mcit}{NIRD}
\newcommand{\mcita}{NIRD-A}
\newcommand{\rel}[1]{\sigma_{#1}(v_i)}
\newcommand{\relv}[1]{\sigma_{#1}^{v_i}}
\newcommand{\relt}[1]{{\sigma(v_i, {#1}_t, G)}}



\begin{document}

\title{Non-Parametric Inference of Relational Dependence (Supplementary File)}
\author[1]{Ragib Ahsan}
\author[1]{Zahra Fatemi}
\author[2]{David Arbour}
\author[1]{Elena Zheleva}
\affil[1]{%
    Department of Computer Science\\
    University of Illinois at Chicago\\
    Chicago, IL, USA
}
\affil[2]{%
    Adobe Research, USA
}

\onecolumn
\maketitle


% \begin{abstract}
% \end{abstract}

\appendix


\section{Proofs}
In this section we present the proofs of consistency for HSIC and relational HSIC under weak dependence. 
The approach here is to extend the results of \citet{chwialkowski-nips14} and \cite{leucht-jma13}, who analyze degenerate $U$ and $V$-statistics (which includes HSIC as a specific instantiation) under weak dependence in spaces that admit euclidean distances to the more general setting of graph structured spaces.
Much of the results carry through after modifications to accommodate the fact that the number of reachable instances at a specific distance is irregular. We present a modification of the relevant proof which shows the convergence of the distribution of degenerate $V$-statistics, which may be of independent interest, and then describe the application to our setting and the extension to relational variables. 

\subsection{$V$-Statistics Under Relational Weak Dependence} 
Let $X = \{X_1,\dots X_n\}$ be the set of given observations. 
Define $h$ to be a symmetric function, taking $m$ arguments. 
A $V$-statistic is a function defined with respect to $h$ taking the form

\begin{align*}
    V(h, X)_n = \frac{1}{n^m}\sum_{i\in i_1 \dots i_m \in N^m} h(X_{i_1}, \dots X_{i_m})
\end{align*}

where $N^m$ is defined as the Cartesian product of the set $1,\ldots,n$ and $n$ is the total number of observations. 
In the sequel, we will write $V(h,X)$ as $V(X)$ to reduce notational clutter. 
We will refer to $h$ as the \textit{core}\footnote{In order to prevent confusion, we follow \citet{chwialkowski-nips14} and do not follow the canonical convention of calling $h$ the kernel.}. 

We say that a core $h$ is \emph{$j$-degenerate} if for every $x_1,\dots,x_j$, 
\begin{align*}
E[h(X_1,\dots,X_j,X^*_{j+1},\dots,X^*_m)] = 0
\end{align*}
where
$X^*_{j+1},\dots,X^*_m$ are independent samples drawn from the same distribution as $X_1$.
A core is called canonical if for all $j \leq m - 1$ it is $j$-degenerate. 
Finally, we call a $V$-statistic with a 1-degenerate core a \emph{degenerate $V$-statistic}.

We now provide a proof of consistency of degenerate $V$-statistics for relational data under weak dependence. 
The strategy of this proof is to first approximate the $V$-statistic with weighted sums of squares, and then apply the central limit theorem to this approximation. 
The approximation used is the spectral decomposition of the core
\begin{align*}
 h(x, y) = \sum_k\lambda_k\Phi(x)\Phi(y)
\end{align*}
where $\lambda_k$ are the nonzero eigenvalues of $E[h(x, X_0)\Phi(X_0)] = \lambda\Phi(x)$, and $\Phi(x)$ are the associated eigenvectors.
This strategy largely mirrors what is found by \citet{leucht-jma13}.
However, in that case, the approximations are constructed as a function of distance in time. 
Our contribution is a generalization of the approximations to network domains that follow the aforementioned assumptions.
This is done by considering \textit{sets} of instances separated by shortest path distance of $k$, rather than assuming that there is always a single instance at distance $k$, and adapting results accordingly.

\setcounter{theorem}{1}
\begin{theorem}
\label{thm:weakv}
Let $(Z_k)_k$ be centered, jointly normal random variables with $\Cov(Z_j, Z_k) = \sum_{r=-\infty}^\infty\Cov(\Phi_j(X_0), \Phi_k(X_r))$, and $(\lambda_k)_k, (\Phi_k)_k$ be the sequence of non-zero eigenvalues and corresponding eigenfunctions of 
$E\left[h(x, X_0)\Phi(X_0)\right] = \lambda\Phi(x)$.
Under the aforementioned assumptions, $V_n \overset{d}{\longrightarrow} Z := \sum_k \lambda_k Z^2_k$, as $n \rightarrow \infty$, and 
$EZ = \sum_{r\in\mathbb{Z}}Eh(X_0,X_r) < \infty$
i.e., the infinite series that defines $Z$ converges in $L_1$. 
\end{theorem}

\begin{proof}
Let $(\lambda_k)_k$ be an enumeration of the positive eigenvalues of $Eh(x,X_0)\Phi(X_0)=\lambda\Phi(x)$ sorted in decreasing order, and $(\Phi_k)_k$ be the corresponding eigenfunctions. 
Following ~\citet{leucht-jma13}, we set $\lambda_k := 0, \Phi_k \equiv 0, \forall k > L$, when the number $L$ of non-zeros eigenvalues is finite. 
We are given from a version of Mercer's theorem (given by Theorem 2 of Sun~\cite{sun-jc05}) that 
\begin{equation*}
    \label{eq:series}
    h^{(K)}(x,y) = \sum_{k=1}^K\lambda_k\Phi_k(x)\Phi_k(y) \underset{K\rightarrow\infty}{\longrightarrow}h(x,y), \forall x,h \in \textrm{supp}(P^{X_0}) 
\end{equation*}
\citet{leucht-jma13} provide the prerequisites necessary for the equation to converge absolutely and uniform on compact subsets of $\textrm{supp}(P^{X_0})$, which apply directly in our setting as well. 
We will consider an approximation of $V_n$ by a $V$-statistic with a kernel with finite spectral decomposition given by $V_n^{(K)} = \frac{1}{n}\sum_{s,t}^n h^{(K)}(X_s, X_t)$. 
Because $h$ is positive semi-definite by definition, all eigenvalues are non-negative, implying $V_n - V_n^{(K)} \geq 0$. 
This implies
\begin{align*}
    &E\left| V_n - V_n^{(K)}\right| = E\left[ V_n - V_n^{(K)}\right]\\
    &=E\left[ h(X_0, X_0) - h^{(K)}(X_0, X_0) \right] + \sum_{r=1}^{n-1}2(1-r/n)E\left[ h(X_0, X_r) - h^{(K)}(X_0, X_r) \right]
\end{align*}
By majorized convergence the first term converges to zero as $K\rightarrow\infty$.
For the second term, repeated application of Cauchy-Schwarz gives
\begin{align*}
    &\sum_{r=1}^{n-1}2(1-r/n)E\left[ h(X_0, X_r) - h^{(K)}(X_0, X_r) \right]\\
    &\leq 2\sum_{r=1}^\infty\left| \sum_{j \in \Delta_r} E\left[ \sum_{k=K+1}^\infty \lambda_k \Phi_k(X_0)\Phi_k(X_j) \right] \right|\\
    &= 2 \sum_{r=1}^\infty  \left| E\left[\sum_{j \in \Delta_r} \sum_{k=K+1}^\infty \lambda_k \Phi_k(X_0)   (\Phi_k(X_j) - \Phi_k(\widetilde{X}_j)) \right] \right|\\
    &\leq 2\sum_{r=1}^\infty \sqrt{E\left[\sum_{j \in \Delta_r}  \sum_{k=K+1}^\infty \lambda_k\Phi_k^2(X_0) \right]} \sqrt{E\left[  \sum_{j \in \Delta_r}\sum_{k=K+1}^\infty \lambda_k\left( \Phi_k(X_r) - \Phi_k(\widetilde{X}_j) \right)^2 \right]}\\
    &\leq 2\sqrt{\sum_{r=1}^\infty \lambda_k}\sum_{r=1}^\infty\sqrt{E\left[ \sum_{j \in \Delta_r}\sum_{k=1}^\infty \lambda_k \left( \Phi_k(X_j) - \Phi_k(\widetilde{X}_j) \right)^2 \right]}\\
    &\leq 2\sqrt{\sum_{k=K+1}^\infty\lambda_k}\sum_{r=1}^\infty\sqrt{\sum_{j \in \Delta_r} E \left[ h(X_j, X_j) - h(X_j, \widetilde{X}_j) - h(\widetilde{X}_j, X_j) + h(\widetilde{X}_j, \widetilde{X}_j) \right]}\\
    &\leq 2\sqrt{\sum_{k=K+1}^\infty}\sum_{r=1}^\infty\sqrt{2\max(\textrm{deg})^r\textrm{Lip(h)}}\sqrt{\tau(r)}
\end{align*}
Where $\Delta_r$ is the set of nodes whose shortest path distance from $X_0$ is $r$, $\max(\textrm{deg})$ is the largest degree in the network, and $\widetilde{X}_r$ denotes a copy of $X_r$ that is independent of $X_0$ and satisfies $E\|X_r - \widetilde{X}_r\|_1 \leq \tau(r)$. 
Because $\sum_{k=1}^\infty\lambda_K = Eh(X_0, X_0) < \infty)$, thus $\sum_{k=K+1}^\infty\lambda_k \rightarrow 0$ as $K\rightarrow\infty$ we arrive at $\underset{n}{\sup}E\left| V_n - V_n^{(K)} \right| \underset{K\rightarrow\infty}{\longrightarrow} 0 $. 

The proof of the central limit theorem for partial sums, i.e., for $K \leq L$
\begin{align}
    V_n^{(K)} = \sum_{k=1}^K\lambda_k\left(n^{-1/2}\sum_{t=1}^n \Phi_k(X_t) \right)^2 \overset{d}{\longrightarrow} \sum_{k=1}^K \lambda_kZ^2_k
\end{align}
follows a direct application of ~\cite{leucht-jma13} Theorem 2.1 proof part (\textit{ii}). 
Combining these two results, to satisfy the requirements of Theorem 2 of ~\citet{dehling-spa09} we arrive at $V_n \overset{d}{\longrightarrow} Z := \sum_k\lambda_kZ^2_k$. 
The only item remaining to be shown is $EX < \infty$, which follows from a direct application of part (\textit{iv}) of the proof of Theorem 2 provided by ~\citet{leucht-jma13}. 
\end{proof}

We now turn our attention to the Hilbert-Schmidt independence criterion.
Note that both follow almost immediately from implications of theorem \ref{thm:weakv}.

\setcounter{theorem}{0}

\begin{theorem}
\label{thm:consprop}
Under the aforementioned assumptions the Hilbert-Schmidt independence criterion of two weakly dependent propositional variables converges in $L_1$ to its population counterpart, i.e., $\left|\overline{\text{HSIC}_n} -\text{HSIC}_{\text{population}}\right| \underset{d}{\longrightarrow}0$.
\end{theorem}
\begin{proof}
Recall that the Hilbert-Schmidt independence criterion~(HSIC) is a test of dependence, i.e. a hypothesis test of paired samples where the null hypothesis is that the two samples are generated independently, $\mathbb{P}_{x,y} = \mathbb{P}_x\mathbb{P}_y$. 
Our focus is on the empirical estimator of HSIC, which can be written as degree-four $V$-statistic with a core defined by:
\begin{align}
\label{eq:vsic}
    h(x_1, x_2, x_3, x_4) =
    &\frac{1}{4!}\sum_{\pi\in S_4} k(x_{\pi(1)}, x_{\pi(2)})k(y_{\pi(1)}, y_{\pi(2)}) + 
    k(y_{\pi(3)}, y_{\pi(4)})
    - 2k(y_{\pi(2)}, y_{\pi(3)})
\end{align}
where $S_n$ is the set of permutations over a set of $n$ elements.
Convergence then follows as a direct application of theorem \ref{thm:weakv} and the weak law of large numbers.
Note that under independence $Z$ is a zero mean, jointly Gaussian variable and the resulting sequence $\sum_i \lambda_i Z^2_i$ is mean zero.
\end{proof}


\setcounter{corrollary}{0}
\begin{corrollary}
Under the aforementioned assumptions the Hilbert-Schmidt independence criterion between a weakly relational and a weakly dependent propositional variable converges in $L_1$ to its population counterpart, i.e., $\left|\text{HSIC}_n -\text{HSIC}_{\text{population}}\right| \underset{d}{\longrightarrow}0$.
\end{corrollary}
\begin{proof}
The central items to be shown in order to apply the results of theorem \ref{thm:weakv} to apply are (1) relational kernels define a valid $V$-statistic, and (2) the relational variable remains weakly-dependent. 
Item (1) follows directly by denoting one of the variables in equation \ref{eq:vsic} to be a set of instances return by the path predicate and $k$ to be the relational kernel defined in the main text. 
Item (2) follows as a consequence of assumption 5 which bounds the degree of each node by a finite constant, $c$. As a result, any path predicate which defines a finite length path will return a set no larger than $c < c' < \infty$. As a result, so long as the initial random variable is weakly dependent, the relational variable constructed from the initial random variable will also be weakly-dependent, albeit with a slower rate of convergence since the coefficient $\tau_r$ (the weak dependence coefficient) will necessarily decay more slowly. 
\end{proof}









\section{Extension to Multi-relational Systems}

In our problem definition we assumed a single-entity, single relationship relational schema for ease of exposition. Here, we discuss necessary extensions for a multiple entity, multi-relational system. We consider a set of item classes $\bm{\mathcal{I}}$ to be the union of entities and relationship classes, $\bm{\mathcal{I}} = \bm{\mathcal{E}} \cup \bm{\mathcal{R}}$, following prior work ~\citep{lee-uai17,maier-uai13}. We refer to the attribute class of an item class $I \in \bm{\mathcal{I}}$ as $\bm{\mathcal{A}}(I)$. Moreover, let $G(I)$ denote a set of items of an item class $I \in \bm{\mathcal{I}}$.

Here, we point out two major differences in a multi-relational system:

\begin{enumerate}
    \item The relational dependence is specifically defined between two item classes $I \in \bm{\mathcal{I}}$ and $I \in \bm{\mathcal{J}}$.
    \item The path predicate $\rho$ is likely to be defined with relational queries rather than random walks over a %
    neighborhood.
\end{enumerate}

Now, we revisit definition 1 from the main text with the new notation as follows:

\begin{dfn}[Relational Variable]
Given a relational schema $\mathcal{S} = \langle \bm{\mathcal{E}}, \bm{\mathcal{R}}, \bm{\mathcal{A}} \rangle$, its instantiation $G$, two item classes $I,J \in \bm{\mathcal{I}}$ and a path predicate $\rho$, a relational variable $\sigma(v_i, \bm{X}, G, \rho)$ is the set of attributes $v_j.\bm{X}$ selected by $\rho$ of items $v_j \in G(J)$ reachable from items $v_i \in G(I)$ such that $\bm{X} \subset \bm{\mathcal{A}}(J)$, where the path predicate $\rho$ is a function given by:

\[
    \rho(v_i, G) : G(I) \mapsto \mathcal{P}(G(J))
\]

\end{dfn}

The necessary assumptions and relational dependence definitions still hold. The major difference arises in the compact representation of the relational kernel. Equation 1 stays valid with an updated notion of path predicate. However, the compact representation in equation 2 is no longer trivial since the adjacency matrix $A$ is no longer directly applicable. There are two potential workarounds. First, since the compact representation is not mandatory for our method to work, we can still work with equation 1 for multi-relational systems. Second, we can essentially consider the bipartite graph between sets of items between item classes $I, J \in \bm{\mathcal{I}}$ and use the adjacency matrix $A_{IJ}$ of this bipartite graph instead of $A$. Similarly a corresponding degree matrix $D_{IJ}$ can be constructed from $A_{IJ}$.


\section{Experiments}
\subsection{Synthetic Attribute Generation} \label{subsec:att_gen}
Here, we describe the synthetic attribute generation procedure for the three cases mentioned in the main text. Note that, only the generation of $v_i.Y$ differs in null and alternate hypothesis while others stay the same. We consider polynomial dependency model for most of our experiments. $v_i.X$ for case 1 and $v_i.Z$ for cases 2,3 is drawn from a uniform distribution $U(0, 1)$ while $v_i.X$ is always \textit{binarized} to resemble the effect of treatment assignment. The outcome $v_i.Y$ is generated according to the following equation for marginal dependence (case 1):
\begin{equation}
    v_i.Y \thicksim
            \begin{cases}
              U(0, 1) & null\\
              \beta_d \cdot (g(\rel{x}))^2  + \epsilon & alternate
            \end{cases}
\end{equation}

Conditional dependence (case 2) is reflected by the following equation:
\begin{equation}
\label{eqn:case1}
    \begin{split}
        & v_i.X \thicksim \beta_c \cdot (v_i.Z)^2  + \epsilon\\
        & v_i.Y \thicksim
            \begin{cases}
              \beta_c \cdot (v_i.Z)^2  + \epsilon & null\\
              \beta_d \cdot (g(\rel{X}))^2 + \beta_c \cdot (v_i.Z)^2 + \epsilon & alternate
            \end{cases}    
    \end{split}
\end{equation}

Here, $\beta_d$ and $\beta_c$ are dependence and confounding coefficients respectively. $\beta_c$ is considered 1.0 in our experiments. $\epsilon$ is noise drawn from standard normal ($N(0, 1)$) distribution. $g$ refers to the \textit{mean} aggregate function. We can get the generating function for case 3 by replacing $g(\rel{X})$ and $v_i.Z$ with $v_i.X$ and $g(\rel{Z})$ respectively in equation \ref{eqn:case1}. Next, we consider the following procedure to simulate linear threshold model for the diffusion experiment which falls under case 1:
\begin{equation}
    \begin{split}
        T_i & \thicksim U(0, 1)\\
        v_i.x_{t+1} & = \mathbbm{1}(mean(\rel{x_t}) > T_i)\\
        v_i.y_{t+1} & = \mathbbm{1}(g(\rel{x_t}) > T_i)
    \end{split}
\end{equation}
where we reassign $v_i.x$ values to simulate each diffusion step based on its value in previous step. The $v_i.y$ values are assigned based on $v_i.x$ values in the last diffusion step.






    
    
\setcounter{figure}{4}

\begin{figure*}[h]
    \centering
    \subfloat{\includegraphics[width=.76\textwidth]{fig/fig_1_2_legend.eps}}\\
    
    \setcounter{subfigure}{0}
    
    \subfloat[BA: Case 1 ]{\label{sfig:vnv_a}\includegraphics[width=0.28\textwidth]{fig/exp_1b_dep_ba_mean_poly_type_ii_var2.eps}}
    \hspace{1em}
    \subfloat[BA: Case 2 ]{\label{sfig:vnv_c}\includegraphics[width=0.28\textwidth]{fig/exp_2b_dep_ba_mean_type_ii_var2.eps}}
    \hspace{1em}
    \subfloat[BA: Case 3 ]{\label{sfig:vnv_e}\includegraphics[width=0.28\textwidth]{fig/exp_2d_dep_ba_mean_type_ii_var2.eps}}\\
    
    \subfloat[ER: Case 1 ]{\label{sfig:vnv_b}\includegraphics[width=0.28\textwidth]{fig/exp_1b_dep_er_mean_poly_type_ii_var2.eps}}
    \hspace{1em}
    \subfloat[ER: Case 2 ]{\label{sfig:vnv_d}\includegraphics[width=0.28\textwidth]{fig/exp_2b_dep_er_mean_type_ii_var2.eps}}
    \hspace{1em}
    \subfloat[ER: Case 3 ]{\label{sfig:vnv_f}\includegraphics[width=0.28\textwidth]{fig/exp_2d_dep_er_mean_type_ii_var2.eps}}\\
    
    
    
    \caption{Relational dependence impact on Type I/II errors while variance of noise varied  $\thicksim \mathcal{N}(1, 0.2)$ over multiple trials.}%
    
    \label{fig:dep_vnv}
\end{figure*}


\subsection{Impact of varied noise variance}
\label{ssec:noise_vary}

We conducted an experiment where we draw noise variance from a normal distribution $\sigma^2 \thicksim N(1, 0.2)$ over different trials. From figure \ref{fig:dep_vnv} we can see a slight change of type-II errors compared to Figure 1 in the main paper. However, the trend seems to be very similar.



\subsection{Impact of activation probability on diffusion}

\begin{figure}[!ht]
    \centering
    \label{sfig:ltm_tw}\includegraphics[width=.30\textwidth]{fig/exp_1b_ltm_tw_steps_type_ii.eps}
    
    \caption{Type II error for the Linear Threshold Model on Twitter ego-network.}
    \label{fig:twitter}
\end{figure}

In order to showcase the applicability of the proposed method on large scale real world relational data, we show an extended version of the diffusion experiment from the main paper. We consider a similar semi-synthetic setup with Twitter ego-network which is a larger real world network consisting 11,176 nodes and 1,44,653 edges~\citep{leskovec-nips12}. We consider a sample of 10,000 nodes and vary the initial activation probabilities. Figure \ref{fig:twitter} shows the Type-II errors (y-axis) for different diffusion step sizes (x-axis). The lines correspond to the initial activation probabilities (AP) for the diffusion process. We see the general trend of decreasing Type-II error with higher step sizes. It seems to be almost saturated with step 10. Moreover, the result indicates that the test is sensitive to activation probabilities and with higher activation probability, it shows higher type II error.


\subsection{Comparison to Sobolev Independence Criterion (SIC)}


\begin{figure}[h]
    \centering
    \subfloat{\includegraphics[width=.86\textwidth]{fig/fig_1_2_legend_sic.eps}}\\
    
    \setcounter{subfigure}{0}
    \subfloat[Case 1: \text{\ba} ]{\label{sfig:dep_s_a_sic}\includegraphics[width=0.30\textwidth]{fig/exp_1b_dep_ba_mean_poly_type_ii_sic.eps}}\hfill
    \subfloat[Case 2: \text{\ba} ]{\label{sfig:dep_s_c_sic}\includegraphics[width=0.30\textwidth]{fig/exp_2b_dep_ba_mean_type_ii_sic.eps}}\hfill
    \subfloat[Case 3: \text{\ba} ]{\label{sfig:dep_s_e_sic}\includegraphics[width=0.30\textwidth]{fig/exp_2d_dep_ba_mean_type_ii_sic.eps}}\\
    
    \subfloat[Case 1: \text{\er} ]{\label{sfig:dep_s_b_sic}\includegraphics[width=0.30\textwidth]{fig/exp_1b_dep_er_mean_poly_type_ii_sic.eps}}\hfill
    \subfloat[Case 2: \text{\er} ]{\label{sfig:dep_s_d_sic}\includegraphics[width=0.30\textwidth]{fig/exp_2b_dep_er_mean_type_ii_sic.eps}}\hfill
    \subfloat[Case 3: \text{\er} ]{\label{sfig:dep_s_f_sic}\includegraphics[width=0.30\textwidth]{fig/exp_2d_dep_er_mean_type_ii_sic.eps}}
    
    \caption{Type I/II errors with polynomial dependency model on synthetic networks for all three cases.}
    
    \label{fig:dep_s_sic}
\end{figure}

To show the effectiveness of relational CI methods vs. CI methods developed for i.i.d. data, we compare both RCI methods (NIRD, KRCIT) to a recent i.i.d. CI test, the Sobolov Independence Criterion~\citep{mroueh-nips19}. SIC is an interpretable dependency measure between multivariate random variables characterized by integral probability metric between the joint distribution and the product of the marginals. We perform the SIC test on the flattened representation of the relational data, similar to KRCIT. Figure \ref{fig:dep_s_sic}  extends the results shown in Figure 1 in the main text. In all three cases, the i.i.d. baseline SIC exhibits high Type I error which shows its poor calibration to reasoning over the relational data. %






\section{Real-world demonstration}
One of the main challenges in social studies is to identify the effect of friends on their peers and the strength of such effects in different domains, e.g. health and violence. Studies show that patterns of interactions among adolescents can reveal possible reasons for changes in their behavior over time. The central question in such studies is how to identify and measure the existence of such effects. The proposed independence test can facilitate reasoning over the existence of dependence between peers' behaviors in social networks by providing a mechanism for falsifying statistical hypotheses.

As a demonstration, we examine the 50 Women dataset~\citep{michell-ssm97}. This dataset has the smoking, sport, drug, alcohol consumption habits of 50 female students, along with their friendship information, over the course of three years. Each of the behavioral variables are coded as categorical variables indicating how regularly women engage in each of the behaviors.
Assuming independence between the behavior peers as the null hypothesis, the goal of  this analysis is to explore whether the habits of a student's friends are associated with her habits in subsequent years. 

\begin{table}
\caption{Real-world demonstration: exploration of the dependence between the habits of students and their first-hop neighbors in 50 Women dataset }
\centering
\small\addtolength{\tabcolsep}{-2pt}
\begin{tabular}{lrrrrrrr}  
period& attribute & attribute type & t0& \mcit\_all & \mcit\_t0\\
\midrule
    1 $\rightarrow$ 2 &   alcohol &      binary &   4 &  0.425532 &  0.000000 \\
    1 $\rightarrow$ 2 &   alcohol &  categorical &   NA &  0.000000 & NA \\
    1 $\rightarrow$ 2 &      drug &      binary &  35 &  0.000000 &  0.138298 \\
    1 $\rightarrow$ 2 &      drug &  categorical &   NA &  0.000000 & NA \\
    1 $\rightarrow$ 2 &     smoke &      binary &  35 &  0.000000 &  0.021277 \\
    1 $\rightarrow$ 2 &     smoke &  categorical &   NA &  0.000000 & NA \\
    1 $\rightarrow$ 2 &     sport &      binary &  12 &  0.925532 &  0.978723 \\
    1 $\rightarrow$ 2 &     sport &  categorical &   NA &  0.925532 & NA \\
    1 $\rightarrow$ 2,3 &   alcohol &      binary &   5 &  0.114583 &  0.197917 \\
    1 $\rightarrow$ 2,3 &   alcohol &  categorical &   NA &  0.000000 & NA \\
    1 $\rightarrow$ 2,3 &      drug &      binary &  35 &  0.000000 &  0.583333 \\
    1 $\rightarrow$ 2,3 &      drug &  categorical &   NA &  0.000000 & NA \\
    1 $\rightarrow$ 2,3 &     smoke &      binary &  36 &  0.000000 &  0.000000 \\
    1 $\rightarrow$ 2,3 &     smoke &  categorical &   NA &  0.000000 & NA \\
    1 $\rightarrow$ 2,3 &     sport &      binary &  12 &  1.000000 &  0.166667 \\
    1 $\rightarrow$ 2,3 &     sport &  categorical &   NA &  1.000000 & NA \\
    2 $\rightarrow$ 3 &   alcohol &      binary &   3 &  0.125000 &  0.666667 \\
    2 $\rightarrow$ 3 &   alcohol &  categorical &   NA &  0.281250 & NA \\
    2 $\rightarrow$ 3 &      drug &      binary &  32 &  0.000000 &  0.125000 \\
    2 $\rightarrow$ 3 &      drug &  categorical &   NA &  0.000000 & NA \\
    2 $\rightarrow$ 3 &     smoke &      binary &  31 &  0.000000 &  0.281250 \\
    2 $\rightarrow$ 3 &     smoke &  categorical &   NA &  0.000000 & NA \\
    2 $\rightarrow$ 3 &     sport &      binary &  20 &  0.864583 &  0.479167 \\
    2 $\rightarrow$ 3 &     sport &  categorical &   NA &  0.864583 & NA \\
\bottomrule

\end{tabular}

\label{OurMethodVSBaseline}
\end{table}
Table \ref{OurMethodVSBaseline} shows p-values estimated by our kernel test method considering four attributes in 50 Women dataset. We use column $Period$ to indicate the years we consider for the test, e.g., in $Period$ $1 \rightarrow 2,3$, we explore students' behavior change from first year to the second and third year. 
We consider both the original categorical coding and a binarization of the categorical attributes, which is 1 if the student uses a substance at least once during the year and 0 otherwise.
The number of students who did not engage in the behavior during the first time point is shown in column $t0$, e.g, in the first row of the table, 4 students did not drink alcohol in the first year. We exclude $t0$ for categorical data (indicated by NA) because the frequency of the habit is intrinsic to the hypothesis of interest in these cases. The last two columns ( $NIDR\_all$ and $NIDR\_t0$ ) show p-values measured by NIRD. In $NIDR\_all$ and $NIDR\_t0$
we consider all women (whether they have the habit in a year or not), and women who do not have the habit in the first time point, respectively. 
Overall we find:
\begin{itemize}[leftmargin=*]
    \item Sports activity of peers is not associated with whether a student plays a sport or not. High values of $NIDR\_t0$ and $NIDR\_all$ are enough evidences to accept the null hypothesis of independence.
    \item Peer smoking habits are associated with students' frequency of smoking: $NIDR\_all = 0 $ and $NIDR\_t0 < 0.022 $ for all time periods, except period $2 \rightarrow 3$ where $NIDR\_t0 \approx 0.28$.
    \item Peer drug use is not associated with subsequent drug use in previously non-drug using students ( $NIDR\_t0 > 0.05$). However, when we consider the effect of drug users on non-drug users and vice versa, it becomes associated in both the use and rate of consumption ($NIDR\_all =0$). 
    \item Peer alcohol consumption is associated with the level subsequent alcohol use ($NIDR\_all$ $ =0$, except in period $2 \rightarrow 3$ where $NIDR\_all > 0.1$ ),
    but not with the decision for a non-drinking student tot begin drinking.
\end{itemize}

Different studies \citep{michell-ssm97,pearson-depp00} deploy 50 women data to explore the association between gender, risk-taking or social position and smoking or drug usage in groups of youngsters. In particular our results comport with Pearson et al. \citep{pearson-depp00} who show that drug usage and smoking are contagious among group of friends who are highly connected and people who are loosely connected to a friendship group.



\bibliography{ahsan_640}


\end{document}