% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} 
%% In your camera-ready you should use the 'accepted' parameter. This shows the authors and how an accepted paper will look like. The footer is 'Acccepted for X'. In the final version, the proceedings chairs will add the page numbers for PMLR and the final footer will be 'Proceedings of X'.
%
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{algorithm}
\usepackage{algorithmic}
\usepackage{multirow}
\usepackage{booktabs}
\usepackage{caption}
\usepackage{lipsum}
\usepackage{subcaption}
\let\Bbbk\relax
\usepackage{amssymb}
\usepackage{float}
\usepackage{theoremref}
\usepackage{graphicx}
\usepackage{color}
\usepackage{amsthm,amsmath,amssymb}
\usepackage{mathrsfs}

\usepackage[T1]{fontenc}

\newtheorem{theorem}{Theorem}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Mutual Information Based Bayesian Graph Neural Network \\ for Few-shot Learning}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is automatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Important:  case of equal contributions, we strongly recommend to NOT show it in this part of the paper, but rather describe it in the appropriate section at the end of the paper "Author Contribution", where you have more space to describe how each author contributed.
%
% Add authors
% Remember to use the order convention "First/Given name" "Last/Family name", e.g. John Smith, Hanako Yamada, Marco Rossi, Wei Zhang
\author[1,2]{Kaiyu Song}
\author[1,2]{\href{mailto:<kyue@ynu.edu.cn>?Subject=Your UAI 2022 paper}{Kun Yue}{}}
\author[1,2]{Liang Duan}
\author[1,2]{Mingze Yang}
\author[2,3]{Angsheng Li}
% Add affiliations after the authors
\affil[1]{%
    Key Lab of Intelligent Systems and Computing of Yunnan Province\\
    Yunnan University\\
    Kunming, China
}
\affil[2]{%
    School of Information Science and Engineering\\
    Yunnan University\\
    Kunming, China
}
\affil[3]{%
    School of Computer Science and Engineering\\
    Beihang University\\
    Beijing, China
  }
  
  \begin{document}
\maketitle

\begin{abstract}
  In the deep neural network based few-shot learning, the limited training data may 
  make the neural network extract ineffective features, which leads to inaccurate results. 
  By Bayesian graph neural network (BGNN), the probability distributions on hidden layers 
  imply useful features, and the few-shot learning could improved by establishing the 
  correlation among features. Thus, in this paper, we incorporate mutual information (MI) 
  into BGNN to describe the correlation, and propose an innovative framework by adopting 
  the Bayesian network with continuous variables (BNCV) for effective calculation of MI. 
  First, we build the BNCV simultaneously when calculating the probability distributions 
  of features from the Dropout in hidden layers of BGNN. Then, we approximate the MI values 
  efficiently by probabilistic inferences over BNCV. Finally, we give the correlation based 
  loss function and training algorithm of our BGNN model. Experimental results show that our 
  MI based BGNN framework is effective for few-shot learning and outperforms some state-of-the-art 
  competitors by large margins on accuracy.
\end{abstract}

\section{Introduction}\label{sec:intro}
Few-shot learning aims to learn novel concepts from only one or a few annotated samples,
which is an interesting problem and has received a lot of attention recently \citep{Ma}.
Different from traditional machine learning models built on a large amount of training data,
few-shot learning is defined for scenarios with limited supervised experience \citep{Zheng}.
It is challenging to fulfill efficient few-shot learning, since extracting effective and 
representative features often requires large-scale training datasets \citep{Gairola}.



To find more useful features for few-shot learning, several methods have been proposed
and one popular solution is Bayesian graph neural network (BGNN) \citep{Handson}, which is a
graph neural network (GNN) to describe the uncertain relationships among features in datasets.
By using Bayesian approximation over uncertainty, BGNN could extract more effective features
to improve the performance of few-shot learning tasks \citep{Hasanzadeh}. 
However, it is still difficult to achieve a highly accurate few-shot learning method based 
on BGNN, since the limited training data is insufficient for neural networks to catch the 
most useful features and makes over-smooth and over-fitting much more intensively.



One advantage of BGNN is that the probability distribution of the features extracted from
hidden layers could be used to build a larger feature space. If we could eliminate redundant
features and restrain the feature space in terms of the convergence with limited samples,
more useful features could be extracted. Note that the correlation of probability distributions 
among hidden layers implies useful information of features. Thus, by scaling the correlation of 
feature spaces, we could improve the effectiveness of feature extraction 
of BGNN for few-shot learning. For example, given two neighbor hidden layers in a BGNN for face 
recognition, the prior layer extracts \emph{location} and the next layer extracts \emph{shape}. 
Then, based on the correlation between these two kinds of features, we could make the next layer 
to extract both \emph{shape} and \emph{location} instead of just \emph{shape}.



Probability distributions reflect both the features in hidden layers and corresponding propagation 
operations like graph convolution. If the correlation among features could be extracted and 
described, it could be used to make hidden layers of BGNN share with more information and find more 
useful features by maximizing the correlation \citep{KipfandWelling}. For this purpose, we adopt 
mutual information (MI) to describe the correlation quantitatively, and formulate the process of 
MI maximization to make BGNN share as much information as possible by following the forward flow 
in BGNN training \citep{Gabrie}.


However, calculating MI is not trivial even if the probability distribution functions (PDFs) 
have been given \citep{Rana}. It is known that Bayesian network (BN) is a famous framework for 
uncertain knowledge representation and inference via a directed acyclic graph (DAG) of random 
variables with conditional probability parameters \citep{Koller}. Thus, we use BN with continuous 
variables (BNCV) \citep{LiandMahadevan} to effectively approximate MI.


In this paper, we propose an innovative framework to establish the correlation among features 
in hidden layers of BGNN to improve the accuracy of few-shot learning tasks. Specifically, 
we first build our framework based on a Bayesian graph convolution neural network with adaptive 
connection sampling (BGS) \citep{Hasanzadeh} (an efficient version of BGNN). We then approximate 
the probability distributions of features extracted from hidden layers by the relaxed Bernoulli 
distribution in BGS. Thus, the continuity of the probability distributions in BNCV could be guaranteed. 
We then define the nodes of BNCV as the marginal distributions extracted from features, and 
establish the correlation between two neighbor layers and connect the end of the prior pair with 
the start of the next pair based on the forward flow when training BGS. Thus, the DAG of BNCV could 
be constituted. Although the distributions of the nodes in a BNCV cannot be directly obtained, 
the Bayesian approximation of the relaxed Bernoulli distributions based on Dropout has already 
provided the conditional probability function (CPF) for BNCV. Consequently, the approximation
of MI could be fulfilled by using CPF and Monto Carlo integration based on the probabilistic 
inferences over BNCV. Finally, we provide the correlation based loss function and training 
algorithm of our BGNN model.

Our main contributions are summarized as follows:
\begin{itemize}
	\item We propose an innovative framework to extract effective features for few-shot learning 
	by establishing the correlation among features from the probability distributions in BGS.
	
	\item We build BNCV efficiently from the Dropout in hidden layers of BGS and approximate the 
	MI values effectively based on the probabilistic inferences over BNCV to describe the correlation 
	quantitatively.
	
	\item We provide the loss function by incorporating with the MI-based correlation and propose the 
	training algorithm of our BGNN model.
	
	\item We conduct extensive experiments on Cora and Citeseer datasets, and the results show that 
	our proposed framework is effective for few-shot learning and outperforms some state-of-the-art 
	competitors by large margins on accuracy.
	
\end{itemize}




\section{Related work}
Few-shot learning aims to learn novel concepts from only one or a few examples, which is 
an interesting and challenging problem in practical applications \citep{GaoFL0X21}. 
Recently, many meta-learning \citep{metagan} and transfer learning \citep{WangYKN20} methods 
have been proposed to solve this problem. Most of these methods are combined with 
a variety of deep learning models, where the correlation among features is usefully provided. 
For example, the correlation could be obtained by using GNN to propagate structural information 
\citep{GarciaandBruna}, and a similarity metric is established to achieve the correlation between 
two similar samples for few-shot image segmentation \citep{Gairola}. 


However, describing correlation as the hidden feature for few-shot learning is still challenging. 
Thus, several methods have been proposed to establish the concept of correlation. 
\citep{att} give the concept of correlation between global and local features based on the dual 
attention network. \citep{NS} conceptualize the correlation between existing and new relations 
via embedding. \citep{GFL} provide a method to learn the correlation from auxiliary graphs via 
knowledge transfer. To evaluate the correlation explicitly, MI-based methods have been adopted. 
\citep{Di} use MI to build a 2-depth adjacent matrix to leverage the correlation, and \citep{MI} 
use MI to enrich the representation of knowledge extracted from correlation. Moreover, Graphical 
Mutual Information (GMI) has been proposed to measure the correlation between input graphs and 
high-level hidden representations \citep{Peng2} based on graph embedding. By these methods, 
the correlation could be evaluated, but still cannot be calculated quantitatively even MI is adopted.


Note that MI could not be easily calculated and usually approximated in practice \citep{Gabrie}.
Thus, the generative adversarial neural network based method is proposed to approximate
the CPF of density and marginal distributions ultimately \citep{Abbasnejad}. However, these
methods are often inefficient \citep{Gabrie}. To solve this issue, the neural network based
conditional MI was proposed as the approximation of the MI \citep{CCMI}, but it does not hold
in BGNN for few-shot learning due to the limited training data.


Integrating the deep learning and Bayesian model is the subject with much attention to make
interpretation for neural networks or infer the conditional (or even causal)
relations and corresponding uncertainty. For example, \citep{cdnn} propose
the method for unsupervised structure learning of deep neural networks by casting the problem
of neural network structure learning as a problem of BN structure learning. \citep{sin} 
propose a unified algorithm to efficiently learn a compiled inference network and the generative 
model simultaneously for non-linear state space models to mimic the posterior distribution. 
\citep{sobdl} survey the models and applications of Bayesian deep learning to tightly integrate 
deep learning and Bayesian models for establishing a comprehensive artificial intelligence system 
with the capabilities of perception and probabilistic inferences. Different from these methods, 
we adopt BN for evaluating the correlation efficiently, and build BNCV based on the probability 
distributions of hidden layers in BGS.


\begin{figure*}[ht]
	\centering
	\includegraphics[scale=0.6]{framework.pdf}
	\caption{The framework of BGSMI.
		The network structure of BGS is shown at the bottom.
		Built from BGS, the BNCV with DAG and CPFs are shown at the top.
		Blue arrows among Sampling, CPF, MI approximation and DAG show the approximation of 
		$P(L_{i}|L_{i-1})$ by probabilistic inferences over BNCV.
		Black arrows show the direction of forward flow between neighbor hidden layers in BGS.
	}\label{fig:framework}
\end{figure*}





\section{Problem formulation}
%To implement an effective neural network for few-shot learning, we focus on representing the correlation among features in hidden layers in BGS by using MI.
First, we formulate some concepts as the basis of later discussion.

A dataset with limited training data is represented as $D(G,F,Y)$, where $G$ is the undirected graph, $F$ is the set of original features, and $Y$ is the set of labels.

Taking as input the training dataset $D$, a BGS contains $I$, $O$, $L$, and $B$, where $I$ (i.e., $I=D$ for the convenience of expression) and $O$ is the input and output respectively, $L$ is the set of graph convolution operations and $B$ is the set of relaxed Bernoulli distributions of all hidden layers, where
\begin{itemize}
	\item $La(G)$ is the function of passing structural information, such as Laplace decomposition \citep{KipfandWelling} for a given undirected graph $G$, given the activation function $\sigma (\cdot)$.
	
	\item $L=\{L_{1},\dots,L_{n}\}$, $B=\{B_{1},\dots,B_{n}\}$, $L_{0}=I$,\\
	$L_{i}=\sigma(La(G)L_{i-1})B_{i}(1\leq i\leq n)$.
	
	\item $O=Softmax(L_{n})$ , and $n$ is the depth (i.e., number of layers) of BGS.
\end{itemize}

A BNCV contains $V$ and $E$, where $V$ is the set of nodes (i.e., random variables), $E$ is the set of directed edges, where there is a set of conditional probabilities to quantify the dependencies among the nodes in $V$.
The DAG of BNCV is represented as $G^{d}(V,E)$.
To build a BNCV from BGS, we consider generating $V$ based on $B$ and $E$ from the forward flow between neighbor layers in $L$.

MI can be formulated by entropy \citep{Di} w.r.t. the probability distribution of $(L_{i-1},L_{i})$ as follows
\begin{equation}\label{eq:1}
	MI(P(L_{i-1}),P(L_{i}))=H(P(L_{i}))-H(P(L_{i}|L_{i-1}))
\end{equation}
where $P(\cdot)$ is the PDF of random variables and $H(\cdot)$ is the entropy of $P(\cdot)$.

The correlation between a pair of neighbor layers is defined as $(L_{i-1},L_{i})$ and obtained by the approximation of MI, denoted as $\widetilde{\mathcal{M}}(L_{i-1},L_{i})$.
Correspondingly, the correlation of the whole framework is defined as $\mathcal{M}\{(L_{1},L_{2}),\dots,(L_{n-1},L_{n})\}$, abbreviated as $\mathcal{M}(L)$.

We cast our problem of correlation evaluation as the problem of calculating MI-based correlation, which could be efficiently implemented by the probabilistic inferences over BNCV.
Meanwhile, we include correlation into the loss function for training our model, which also makes BGS to possibly make use of the correlation among hidden layers.




\section{Methodology}
\subsection{Framework}

First, we propose the innovative framework, BGS based on MI (BGSMI), which consists of a BGS and a BNCV, shown as Figure~\ref{fig:framework}.
Then, we describe the ideas of BGS, BNCV, MI approximation and training of BNCV, respectively.

\subsubsection{\textbf{BGS}}

To represent the correlation based on limited training data, $(L_{i-1},L_{i})$ could be established directly by following the forward flow between two neighbor layers of BGS.

To establish $\{(L_{1},L_{2}),\dots,(L_{n-1},L_{n})\}$, we limit the propagation scope of correlation, stated in Theorem \ref{th:1}, since only the prior layer influences the result of the current layer according to the formulation of the graph convolution operation $L_{i}=\sigma(La(G)L_{i-1})B_{i}$.
\begin{theorem}\label{th:1}
	The correlation w.r.t. BGS is between every pair of neighbor layers $L_{i-1}$ and $L_{i}$ $(1 < i\leq n)$.
\end{theorem}
By the graph convolution operation, the Dropout in $B$ could be used to approximate the CPF between every pair of neighbor layers, denoted as $P(L_{i}|L_{i-1},La(G)W_{i-1})$, where $W_{i-1}$ is the set of parameters of $L_{i-1}$. $O=Softmax(L_{n})$ is adopted as
$e^{Z_{j}} / {\sum_{j=1}^{|L_{n}|}e^{Z_{j}}}$
to generate the output, where $Z_{j}$ is the \emph{j}th feature generated by $L_{n}$.

\subsubsection{\textbf{BNCV}}

In a BNCV, the DAG represents the correlation between two neighbor layers 
in BGS and CPF is the conditional probability functions for each node in BNCV. To build a BNCV, we consider 
the following two aspects.

First, the marginal distribution $MP(B_{i})$ w.r.t. $L_{i}$ is equal to $P(B_{i})$, which is extracted from $L_{i}$ and regarded as the node in BNCV.
Thus, we generate \textit{V} of BNCV w.r.t. $B$ of BGS.
Following the forward flow in BGS, $V_{i}$ w.r.t. $B_{i}$ has the relationship with $V_{i-1}$ w.r.t. $B_{i-1}$, since the correlation reflects the dependence relationship in BNCV by a directed edge.
$E$ can be generated and further $G^{d}(V,E)$ can be built by following the forward flow between neighbor layers.

Second, according to Theorem~\ref{th:1}, the CPF of each node is defined as
\begin{equation}\label{eq:2}
	\begin{aligned}
		P(L_{i}|L_{i-1})=P(L_{i}|L_{i-1},La(G)W_{i-1})
	\end{aligned}
\end{equation}

Note that Equation~\eqref{eq:2} describes the conditional distribution between two nodes in BNCV and $P(L_{i}|L_{i-1})$ is analogous to the conditional probability table (CPT) in BN with discrete variables.
In terms of the structure of the BGSMI, $P(L_{i}|L_{i-1})$ could be calculated as
\begin{equation}\label{eq:3}
	P(L_{i}|L_{i-1})\approx Bernoulli(\pi_{i})\approx B_{i}(a_{i},b_{i})
\end{equation}
where $Bernoulli(\pi_{i})$ is the Bernoulli distribution with parameter $\pi_{i}$, and $B_{i}(a_{i},b_{i})$ is the relaxed Bernoulli distribution w.r.t. $L_{i}$ with parameters $a_{i}$ and $b_{i}$.

Thus, we could calculate $P(L_{i}|L_{i-1})$ for each pair $(L_{i-1},L_{i})$ $(1 < i\leq n)$ over BNCV.

\subsection{BNCV Based MI Approximation}

We use MI to describe the correlation between $L_{i-1}$ and $L_{i}$, denoted as $\widetilde{\mathcal{M}}(L_{i-1},L_{i})$ and approximated as follows
%According to the definition of MI, the MI could be approximated by

\begin{equation}\label{eq:7}
	\begin{aligned}
		\widetilde{\mathcal{M}}(L_{i-1},L_{i})&=H(P(L_{i}))+H(P(L_{i-1}))\\
		&-H(P(L_{i-1},L_{i}))
	\end{aligned}
\end{equation}


Note that $P(L_{i-1},L_{i})$ could be calculated by the probabilistic inferences over BNCV, and thus the correlation of the whole framework is denoted as $\mathcal{M}(L)$ and defined as
\begin{equation}\label{eq:4}
	\mathcal{M}(L)=\frac{1}{n}\sum_{i=2}^{n}{\widetilde{\mathcal{M}}(L_{i-1},L_{i})}
\end{equation}


Equation~\eqref{eq:4} could be approximated during one epoch of BGS training by probabilistic inferences over BNCV.
Adopting the lower bound of $H(P(L_{i-1},L_{i}))$, we make the following transformation
\begin{equation}\label{eq:8}
	H(P(L_{i-1},L_{i}))\approx max\{H(P(L_{i-1})),H(P(L_{i}))\}.
\end{equation}

Thus, $\widetilde{\mathcal{M}}(L_{i-1},L_{i})$ in Equation~\eqref{eq:7} could be formulated as
\begin{equation}\label{eq:9}
	\widetilde{\mathcal{M}}(L_{i-1},L_{i})=H(P(L_{i-1}))+H(P(L_{i}))-max\{\cdot\},
\end{equation}
where $max\{\cdot\}$ denotes $max\{H(P(L_{i-1})),H(P(L_{i}))\}$ for the convenience of expression, and the entropy is calculated by
\begin{equation}\label{eq:10}
	H(P(L_{i}))=-\int{P(L_{i})\log_{2}P(L_{i})}\,dx.
\end{equation}

By the Monto Carlo integration \citep{Handson}, Equation~\eqref{eq:10} could be transformed as
\begin{equation}\label{eq:11}
	H(P(L_{i}))\approx-t\sum_{Sp(L_{i})}{\log_{2}P(L_{i})},
\end{equation}
where $Sp(\cdot)$ denotes the sample space of variables, $t=\frac{1}{Ns(Sp(L_{i}))}$, and $Ns(\cdot)$ denotes the number of possible variables in $Sp(\cdot)$.

Importantly, the sample space is equal to the number of samples during the process of sampling \citep{Handson}. Based on the independent relationships represented in BNCV, $P(L_{i})$ could be formulated as
\begin{equation}\label{eq:12}
	P(L_{i})=\int{P(L_{i},L_{i-1},\dots,L_{1})}\,d(Sp(L_{i-1},\dots,L_{1})).
\end{equation}
By Monto Carlo integration, we have
\begin{equation}\label{eq:13}
	P(L_{i})\approx \frac{1}{Ns(sl)}\sum_{sl}{\frac{P(L_{i},L_{i-1},\dots,L_{1})}{P(L_{i-1},\dots,L_{1})}},
\end{equation}
where $sl$ denotes $Sp(L_{i-1},\dots,L_{1})$, and $\sum_{sl}{\frac{P(L_{i},L_{i-1},\dots,L_{1})}{P(L_{i-1},\dots,L_{1})}}$ denotes the fraction for all possible combinations of $P(L_{i},L_{i-1},\dots,L_{1})$ and $P(L_{i-1},\dots,L_{1})$ in $Sp(L_{i-1},\dots,L_{1})$ for the convenience of expression.

By sampling over the relaxed Bernoulli distributions $B$ in BGSMI, we could obtain $P(L_{i}|L_{i-1})$ efficiently by Equation~\eqref{eq:3}.
It is worth noting that the parameters $a_{i}$ and $b_{i}$ of $B_{i}$ have already been calculated during the training of BGS.
Then, using the chain rule, the joint probability distribution of $\{L_{i},L_{i-1},\dots,L_{1}\}$ could be obtained as
\begin{equation}\label{eq:15}
	\begin{aligned}
		P(L_{i},L_{i-1},\dots,L_{1})&\approx \prod_{j=2}^{i}P(L_{j}|L_{j-1})\\
		&=P(L_{i}|L_{i-1})\prod_{j=2}^{i-1}P(L_{j}|L_{j-1})\\
		&\approx P(L_{i}|L_{i-1})P(L_{i-1},\dots,L_{1}).
	\end{aligned}
\end{equation}

By Equation~\eqref{eq:13} and Equation~\eqref{eq:15}, we have
\begin{equation}\label{eq:16}
	H(P(L_{i}))\approx \frac{-1}{Ns(Sp(L_{i}))}\sum_{Sp(L_{i})}{\log_{2}P(L_{i}|L_{i-1})}.
\end{equation}

Finally, by Equation~\eqref{eq:9} and Equation~\eqref{eq:16}, $\widetilde{\mathcal{M}}(L_{i-1},L_{i})$ in BGSMI could be appropriated as
\begin{equation}\label{eq:17}
	\begin{aligned}
		\widetilde{\mathcal{M}}(L_{i-1},L_{i})=-\frac{1}{Ns(Sp(L_{i}))}\sum_{Sp(L_{i})}{\log_{2}P(L_{i}|L_{i-1})}\\
		-\frac{1}{Ns(Sp(L_{i-1}))}\sum_{Sp(L_{i-1})}{\log_{2}P(L_{i-1}|L_{i-2})}-max\{\cdot\}.
	\end{aligned}
\end{equation}

Thus, $\widetilde{\mathcal{M}}(L_{i-1},L_{i})$ could be approximated by the CPFs in BNCV directly.
The procedure of the approximation of Equation~\eqref{eq:4} is given in Algorithm~\ref{al:2}, 
whose complexity is $\emph{O}(n|W_{B}|)$.


\begin{algorithm}[htbp]
	\caption{Approximation of MI-based Correlation}
	\raggedright
	\hspace*{\algorithmicindent} \textbf{Input:} $I$\\
	\raggedright
	\hspace*{\algorithmicindent} \textbf{Parameters:}  $W_{B}$, the weights in $B$\\
	\raggedright
	\hspace*{\algorithmicindent} \textbf{Output:} $\mathcal{M}(L)$
	\begin{algorithmic}[1]\label{al:2}
		\STATE Let $t$ be an empty list with equal size to $W_{B}$
		\STATE $i\gets 1$
		\WHILE{$i\leq |t|$}
		\STATE Sample $P(L_{i}|L_{i-1})$ by Equation~\eqref{eq:3}
		\STATE Calculate $\widetilde{\mathcal{M}}(L_{i-1},L_{i})$ by Equation~\eqref{eq:17}
		\STATE $t[i]\gets \widetilde{\mathcal{M}}(L_{i-1},L_{i}))$
		\STATE $i\gets i+1$
		\ENDWHILE
		%\STATE // By Equation~\eqref{eq:4}
		\RETURN $\mathcal{M}(L)\gets \frac{1}{|t|}\sum_{i=2}^{n}\widetilde{\mathcal{M}}(L_{i-1},L_{i})$  // Approximation of Equation~\eqref{eq:4}
	\end{algorithmic}
\end{algorithm}

\subsection{Training algorithm}
Following $(L_{1},L_{2}),\dots,(L_{n-1},L_{n})$, the correlation between neighbor layers is incorporated into BGS by calculating the correlation of the whole framework.
To intensify the correlation between neighbor layers in BGS, we maximize Equation~\eqref{eq:4} in the loss function of BGSMI, since the larger the correlation, the less the uncertainty between neighbor layers.
Thus, the loss function of BGSMI is formulated as
\begin{equation}\label{eq:5}
	\mathcal{L}=\mathcal{L}_{B}+\ln{\mathcal{M}(L)},
\end{equation}
where $\mathcal{L}_{B}$ is the loss function of BGS \citep{Hasanzadeh}, formulated as
\begin{equation}
	\begin{aligned}\label{eq:loss}
		\mathcal{L}_{B}&=\mathbb{E}_{q(L,B)}{\ln P(Y|F,L,B)}\\
		&-KL(q(L, B)||p(L,B))
		+\xi\sum_{i}^{|F|}{I^{2}_{i}},
	\end{aligned}
\end{equation}
where $q(\cdot)$ and $p(\cdot)$ are the distributions of random variables, $KL(\cdot)$ is the Kullback-Leibler (KL) divergence \citep{Handson}, $I_{i}\in I$, and $\xi$ is a constant.


To avoid under-fitting, we remove $\xi\sum_{i}^{|F|}{I^{2}_{i}}$ from Equation~\eqref{eq:loss}, and thus rebuild the loss function as
\begin{equation}
	\begin{aligned}\label{eq:reloss}
		\mathcal{L}&=\mathbb{E}_{q(L,B)}{\ln P(Y|F,L,B)}\\
		&-KL(q(L,B)||p(L,B))+\ln{\mathcal{M}(L)}.\\
	\end{aligned}
\end{equation}

Therefore, BGSMI could be trained by minimizing the loss function in Equation~\eqref{eq:reloss} via gradient descent.
The above idea of BGSMI training is summarized in Algorithm~\ref{al:1}, whose 
complexity is $\emph{O}(n|W_{B}|+|W_{L}|T)$.

\begin{algorithm}[htbp]
	\caption{Training BGSMI}
	\raggedright
	\hspace*{\algorithmicindent} \textbf{Input:} $I$\\
	\raggedright
	\hspace*{\algorithmicindent} \textbf{Parameters:} $T$, the number of epochs; $lr$, the learning rate\\
	\raggedright
	\hspace*{\algorithmicindent} \textbf{Output:} $W_{L}$, the weights in $L$; $W_{B}$, the weights in $B$ \\
	\begin{algorithmic}[1]\label{al:1}
		\STATE $i \gets 0$
		\STATE Randomly initialize $W_{L}^{i}$ and $W_{B}^{i}$
		\WHILE {$i<T$}
		\STATE Generate $V$ and $E$ from BGS to build $G^{d}(V,E)$
		\STATE Calculate $P^{c}$ based on $G^{d}(V,E)$, $W_{L}^{i}$ and $W_{B}^{i}$
		\STATE Calculate $\mathcal{L}$ by $W_{L}^{i}$, $W_{B}^{i}$ and $P^{c}$// By Equation~\eqref{eq:reloss}
		\STATE $(\nabla \mathcal{L}_{L},\nabla \mathcal{L}_{B})\gets \nabla \mathcal{L}$
		\STATE $W_{L}^{i+1}\gets (W_{L}^{i}-lr*\nabla \mathcal{L}_{L})$
		\STATE $W_{B}^{i+1}\gets (W_{B}^{i}-lr*\nabla \mathcal{L}_{B})$
		\STATE $i\gets i+1$
		\ENDWHILE
		\RETURN $W_{L}^{T}$, $W_{B}^{T}$
	\end{algorithmic}
\end{algorithm}




\section{Experiments}
We evaluate BGSMI to answer the following questions:

\emph{\textbf{Q}}\textbf{1:} How does BGSMI perform in terms of the accuracy compared with other 
state-of-the-art models on limited training data?

\emph{\textbf{Q}}\textbf{2:} How does the MI-based correlation in BGSMI alleviate the over-fitting 
and over-smooth in few-shot learning?

\emph{\textbf{Q}}\textbf{3:} How does noise impact the accuracy of BGSMI based few-shot learning?


\subsection{Experiment settings}


\textbf{Datasets.} We used two benchmarks of citation networks for graph node classification 
\citep{KipfandWelling}, shown in Table~\ref{tb:dataset}. Cora records the citation network publication
\footnote{\url{https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz}}
and Citeseer records the citation information of papers released by Citeseer
\footnote{\url{https://csxstatic.ist.psu.edu/downloads/data}}.

\begin{table}[htbp]
	\centering
	\caption{Statistics of datasets.}
	{\small
		\begin{tabular}{ccccc}
			\toprule
			\textbf{Dataset}  & \textbf{Nodes} &\textbf{Edges} &\textbf{Features per node}&\textbf{Classes}  \\
			\midrule
			Cora & 2,078 & 5,429 & 1,433 & 7 \\
			Citeseer & 3,327  & 4,732 & 3,073 &  6 \\
			\bottomrule
	\end{tabular}}
	\label{tb:dataset}
\end{table}


\textbf{Comparison methods.}
We carefully chose six state-of-the-art methods as competitors for BGSMI:
\begin{itemize}
	\item \textbf{GCN} (graph convolutional network) \citep{KipfandWelling} uses the graph convolution 
	layer with Laplace transform to handle the graphical structure.
	
	\item \textbf{GAT} (graph attention network) \citep{Velic} uses attention to simulate Laplace 
	transform and operates on graph-structured data, leveraging masked self-attentional 
	layers based on graph convolutions or their approximations.
	
	\item \textbf{CHEB} \citep{Defferrard} is a convolutional neural network on graphs with fast 
	localized spectral filtering and applies the fast localized spectral filtering to improve the 
	GCN without concerning the entire graph.
	
	\item \textbf{GMI} (graphical mutual information with standard structure) \citep{Peng2} measures 
	the correlation between input graphs and high-level hidden representations, and generalizes 
	conventional mutual information computations, concerning node features and graphical structure.
	
	\item \textbf{BGS} (Bayesian GCN with adaptive connection sampling) \citep{Hasanzadeh} is an 
	efficient version of BGNN.
	
	\item \textbf{MAML} (model-agnostic meta-learning) \citep{MAML} is a classic meta-learning 
	algorithm to learn the network parameters in deep learning models using a two-step strategy. 
	
	\item \textbf{RALE} (relative and absolute location embedding) \citep{RALE} aligns different 
	tasks toward learning a transferable prior by using the relative and absolute location embedding 
	to solve over-fitting in few-shot node classification on graphs.
	
\end{itemize}




\begin{table*}[htbp]
	\centering
	\caption{Accuracy with different sized training sets.}
	{\small
		\begin{tabular}{c c c c c c c}
			\toprule
			\multirow{2}{*}{\textbf{Method}} &
			\multicolumn{3}{c}{\textbf{Cora}} & \multicolumn{3}{c}{\textbf{Citeseer}}\\
			\cline{2-7}
			& \centering 1-way 3-shot & 1-way 5-shot & 3-way 5-shot &  1-way 3-shot & 1-way 5-shot & 3-way 5-shot\\
			\midrule
			GCN  & $0.542$ & $0.673$ &	$0.797$	& $0.543$	& $0.634$ &	$0.732$\\
			GAT  & $0.541$ & $0.592$ &	$0.762$	& $0.385$	& $0.672$ &	$0.823$\\
			CHEB  & $0.651$ & $0.753$ &	$0.801$	& $0.563$	& $0.752$ &	$0.781$\\
			GMI & $0.618$ & $0.689$ & $0.763$ & $ 0.502$ & $0.722$ & $0.834$\\
			BGS	& $0.308$ & $0.703$ &	$0.679$	& $0.502$	& $0.635$	& $0.665$\\
			MAML & $0.570$ & $0.660$ &	$0.591$	& $0.546$	& $0.618$ &	$0.707$\\
			RALE  & $ 0.752$ & $0.858$ &	$0.888$	& $0.656$	& $0.792$ &	$0.813$\\
			BGSMI &	$\textbf{0.770}$ &	$\textbf{0.859}$ &	$\textbf{0.890}$ &	$\textbf{0.664}$ & $\textbf{0.814}$ & $\textbf{0.855}$	\\
			\bottomrule
	\end{tabular}}
	\label{tb:acc_size}
\end{table*}


\textbf{Implementations.}
We considered the following two experimental variables: depth of the layers (2, 4, and 6 layers), 
and inclusion of noise (adding noise in the training set satisfying normal Gaussian distribution).
For the Cora dataset, we used 140 nodes, 500 nodes and 1000 nodes as the training set, validation 
set and test set, respectively. For the Citeseer dataset, we used 120 nodes as the training set, 
500 nodes and 1000 nodes as the validation set and test set, respectively. Models of GCN, GAT and 
CHEB use the pre-training strategy.

\textbf{Metric.}
We used accuracy to evaluate the effectiveness of BGSMI based few-shot learning, defined as the 
ratio of the number of correct predictions by the neural network to the number of samples.

\textbf{Hyperparameters.}
We concerned four hyperparameters and fixed the learning rate to 0.001 for GMI and 0.005 for 
the other models, the decay rate of L2 regularization to 5$e -3$, and the number of epochs to 400.
The kernel sizes of graph convolutions in the comparison methods are the number of features 
per node multiplied by that of features per node, 128 times of the number of features per node, 
512 times of the number of features per node, $512 \times 512$ and $128 \times 128$. The kernel 
sizes of graph convolutions in BGSMI are 256 times of the number of features per node, 
$256 \times 256$, $256 \times 128$, $128 \times 64$ and $64 \times 64$.

\textbf{Environment.}
Our experiments were run on a machine with an Intel i9 3.6GHz CPU, 128GB RAM and RTX3090 GPUs.
All codes were written in PyTorch.

\subsection{Experimental results}


\begin{table*}[htbp]
	\centering
	\caption{Accuracy with different depths.}
	{\small
		\begin{tabular}{ccccccc}
			\toprule
			\multirow{2}{*}{\textbf{Method}} & \multicolumn{3}{c}{\textbf{Cora}} & \multicolumn{3}{c}{\textbf{Citeseer}}\\
			\cline{2-7}
			& $2$ layers & $4$ layers & $6$ layers & $2$ layers & $4$ layers & $6$ layers\\
			\midrule
			%GCN & $0.593$ & $0.632$ &	$ 0.621$	& $0.553$	& $0.608$ &	$0.608$\\
			GCN  & $0.631$ & $0.683$ &	$0.693$	& $0.603$	& $0.636$ &	$0.643$\\
			%GAT	& $0.586$ &	$0.678$ &	$0.651$	& $0.543$	& $0.620$ &	$0.647$\\
			GAT  & $0.624$ & $0.689$ &	$0.670$	& $0.703$	& $0.733$ &	$0.711$\\
			%CHEB & $0.645$ & $0.692$ & $0.682$ &	$0.612$ &	$0.650$ &	$0.644$\\
			CHEB  & $0.635$ & $0.701$ &	$0.697$	& $0.637$	& $0.694$ &	$0.683$\\
			GMI & $0.627$ & $0.696$ & $0.723$ & $ 0.633$ & $0.702$ & $0.713$\\
			BGS	& $0.503$ & $0.563$ &	$0.544$	& $0.551$	& $0.600$	& $0.561$\\
			MAML  & $0.611$ & $0.653$ &	$0.662$	& $0.596$	& $0.646$ &	$0.624$\\
			RALE  & $ 0.733$ & $0.822$ &	$0.833$	& $0.722$	& $0.761$ &	$0.784$\\
			BGSMI &	$\textbf{0.746}$ &	$\textbf{0.887}$ &	$\textbf{0.847}$ &	$\textbf{0.738}$ & $\textbf{0.772}$ & $\textbf{0.793}$	\\
			\bottomrule
	\end{tabular}}
	\label{tb:acc_depth}
\end{table*}


\textbf{Accuracy of BGSMI based few-shot learning.}
With various sizes of samples in the support and query sets, the accuracy of BGSMI based few-shot 
learning is compared with that of the comparison methods, reported in Table~\ref{tb:acc_size}, 
where '$n$-way $k$-shot' means $n$ samples, included in the support set and query set in each batch, respectively. 

We find that: (a) On Cora, BGSMI achieves the highest accuracy under 1-way 3-shot, 1-way 5-shot, 
3-way 5-shot, with the highest average accuracy of 84.0\%. On Citeseer, BGSMI also achieves the highest accuracy under 1-way 3-shot, 1-way 5-shot, 3-way 5-shot, with the highest average accuracy of 77.8\%. 
(b) On Cora, BGSMI improves almost 2.4\%, 0.1\% and 0.2\% accuracy under the 1-way 3-shot, 1-way 5-shot, 3-way 5-shot compared with the highest accuracy of other comparison models, respectively. On Citeseer, BGSMI improves almost 1.2\%, 2.8\% and 5.2\% accuracy under the 1-way 3-shot, 1-way 5-shot and 3-way 5-shot compared with the highest accuracy of other comparison models, respectively. These results verify the effectiveness of our BGSMI to improve the accuracy of few-shot learning.


\textbf{Alleviation of over-fitting and over-smooth.}
To test how MI alleviates over-fitting and over-smooth, we compared the accuracy of few-shot 
learning based on BGSMI and comparison methods by varying different depths of BGS, reported 
in Table~\ref{tb:acc_depth}. 

We find that: (a) On Cora, the average accuracy of BGSMI keeps almost 82.7\% as the highest 
with the increase of depths of BGS. On Citeseer, BGSMI also achieves the average accuracy of 
76.8\% with the increase of depths. (b) On Cora and Citeseer, BGSMI improves 1.7\% and 1.1\% 
accuracy respectively, compared with other comparison models when the layers increase to 6. 
(c) On Cora and Citeseer, the average rate of accuracy increase/decrease of BGSMI from 2 layers 
to 6 layers is 1.1\%, while 34.9\% of other comparison methods, which shows the accuracy of 
BGSMI remains stable with the increase of depths of BGS. This means that the correlation in BGSMI 
indeed alleviates the over-smooth and over-fitting in few-shot learning.


\begin{figure}[htbp]
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{acc_size_cora.pdf}
		\caption{Cora (different depths)}
		\label{fig:coraend}
	\end{subfigure}%
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{acc_size_citeseer.pdf}
		\caption{Citeseer (different depths)}
		\label{fig:citeend}
	\end{subfigure}
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{acc_size_noise_cora.pdf}
		\caption{Cora (with/without noise)}
		\label{fig:coranoend}
	\end{subfigure}%
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{acc_size_noise_citeseer.pdf}
		\caption{Citeseer (with/without noise)}
		\label{fig:citenoend}
	\end{subfigure}
	\caption{Impacts of training size on accuracy of BGSMI.}
	\label{fig:acc_size}
\end{figure}

\begin{figure}[htbp]
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{acc_depth_cora.pdf}
		\caption{Cora}
		\small
		(different sized training sets)
		\label{fig:citenoend}
	\end{subfigure}%
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{acc_depth_citeseer.pdf}
		\caption{Citeseer}
		\small
		(different sized training sets)
		\label{fig:citenoend}
	\end{subfigure}
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{acc_depth_noise_cora.pdf}
		\caption{Cora (with/without noise)}
		\label{fig:citenoend}
	\end{subfigure}%
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{acc_depth_noise_citeseer.pdf}
		\caption{Citeseer (with/without noise)}
		\label{fig:citenoend}
	\end{subfigure}
	\caption{Impacts of depths on accuracy of BGSMI.}
	\label{fig:acc_depth}
\end{figure}






\textbf{Impacts of parameters.} To evaluate the impacts of experimental variables, we recorded the accuracy of BGSMI based few-shot learning with the increase of training size and different parameters, reported in Figure~\ref{fig:acc_size}.
Meanwhile, we recorded the accuracy with the increase of depths and different parameters, reported in Figure~\ref{fig:acc_depth}.

From Figure~\ref{fig:acc_size}, we can see that on Cora and Citeseer, the accuracy increases with the increase of training size on both datasets with different depths,
and the accuracy does not decrease sharply when adding noise on both datasets with different sized training sets. %, which shows that our method is robust to noise to a certain extent

From Figure~\ref{fig:acc_depth}, we can see that: (a) high accuracy of few-shot learning on Cora could be kept, and the accuracy on Citeseer remains stable, while the accuracy decreases with the increase of depths of BGS. (b) On Cora and Citeseer, the accuracy remains stable with the increase of depths after adding noise. The above results show that BGSMI is robust to noise to a certain extent, and BGSMI based few-shot learning could achieve high accuracy with limited training data under different depths of BGS.



\textbf{Impacts of noise.} MI may intensify the impacts of noise on accuracy, so we evaluated the accuracy of BGSMI based few-shot learning by adding the Gaussian noise, reported in Figure~\ref{fig:noise}.
It tells us that, the average accuracy on Cora remains at 60\% with noise, which is very close to the minimum accuracy without noise.
On Citeseer, the difference between the average accuracy with noise and the quarter-point accuracy without noise is less than 0.1.
This shows that the accuracy of BGSMI does not decrease sharply when adding the Gaussian noise.
\begin{figure}[htbp]
	\centering
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{noise_cora.pdf}
		\caption{Cora}
		\label{fig:noise_cora}
	\end{subfigure}%
	\begin{subfigure}{0.25\textwidth}
		\centering
		\includegraphics[width=0.9\textwidth]{noise_citeseer.pdf}
		\caption{Citeseer}
		\label{fig:noise_citeseer}
	\end{subfigure}
	\caption{Impacts of noise on accuracy of BGSMI.}
	\label{fig:noise}
\end{figure}


\textbf{Summary.} Following the above experimental results with different experimental variables and datasets, we find that more useful features could be provided by incorporating the MI-based correlation into BGNN and indeed improve the accuracy of BGSMI based few-shot learning.
\begin{itemize}
	\item By incorporating with the MI-based correlation, BGSMI outperforms other state-of-the-art methods.
	Specifically, BGSMI improves 67.8\% and 29.7\% average accuracy over BGS on Cora and Citeseer respectively with limited training data. 
	Moreover, BGSMI also achieves the highest accuracy on Cora and Citeseer with different depths of BGS.
	
	\item Efficient approximation of MI could be achieved and BGSMI is insensitive to noise.
	Specifically, the difference between the average accuracy of BGSMI with noise and that without noise is less than 0.2 on different datasets.
\end{itemize}



\section{Conclusion and future work}
In view of the insufficient features in few training data, we propose a framework BGSMI to leverage the feature correlation described by MI to improve the accuracy of BGNN based few-shot learning.
Without proposed framework not only achieves high accuracy, but also alleviates the over-smooth and over-fitting with limited training data in few-shot learning tasks.
Experimental results show that the noise will not make BGSMI destroyed.

As further work, we will study how to enhance our framework to eliminate the influence of noise as much as possible.
For better interpretability of the combination of deep neural network and BN, we will consider incorporating the ideas of Bayesian deep learning models for further integration of BNCV and BGS.


%\begin{contributions} % will be removed in pdf for initial submission,
%                      % so you can already fill it to test with the
%                      % ‘accepted’ class option
%    Briefly list author contributions.
%    This is a nice way of making clear who did what and to give proper credit.
%
%    H.~Q.~Bovik conceived the idea and wrote the paper.
%    Coauthor One created the code.
%    Coauthor Two created the figures.
%\end{contributions}

\begin{acknowledgements} 
This work was supported by the National Natural Science Foundation of China (U1802271, 62002311), Program of Key Lab of Intelligent Systems and Computing of Yunnan Province (202205AG070003), Science Foundation for Distinguished Young Scholars of Yunnan Province (2019FJ011) and Major Project of Science and Technology of Yunnan Province (202202AD080001).
\end{acknowledgements}

\bibliography{song_513}


\end{document}
