\documentclass[accepted]{uai2024} % for initial submission
%\documentclass[accepted]{uai2024} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2024} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2024} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example




%==========
\usepackage{graphicx}
\usepackage[ruled,linesnumbered,vlined]{algorithm2e}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{multirow}
\usepackage{adjustbox}
\newcommand*\tcircle[1]{%
  \raisebox{-0.5pt}{%
    \textcircled{\fontsize{7pt}{0}\fontfamily{phv}\selectfont #1}%
  }%
}
\usepackage{makecell}
\usepackage{amsthm}
\usepackage{amssymb}
\usepackage{hyperref}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% THEOREMS
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\theoremstyle{plain}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{corollary}[theorem]{Corollary}
%\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{assumption}[theorem]{Assumption}
%\theoremstyle{remark}
\newtheorem{remark}[theorem]{Remark}
%==========
\title{Multi-Relational Structural Entropy}

% The standard author block has changed for UAI 2024 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<ycao43@uic.edu>?Subject=Your UAI 2024 paper}{Yuwei Cao}{}}
\author[2]{Hao Peng \thanks{Corresponding author}}
%\blfootnote{\textsuperscript{*}This is the corresponding author.}
%\authornote{This is the corresponding author.}
\author[2]{Angsheng Li}
\author[3]{Chenyu You}
\author[4]{Zhifeng Hao}
\author[1]{Philip S. Yu}
% Add affiliations after the authors
\affil[1]{%
    University of Illinois Chicago\\
    Chicago, USA
}
\affil[2]{%
    Beihang University\\
    Beijing, China
} 
\affil[3]{%
    Yale University\\
    New Haven, USA
}
\affil[4]{%
    Shantou University\\
    Shantou, China
}
%\affil[*]{Corresponding author}
  \begin{document}
\maketitle

\begin{abstract}
  Structural Entropy (SE) measures the structural information contained in a graph. Minimizing or maximizing SE helps to reveal or obscure the intrinsic structural patterns underlying graphs in an interpretable manner, finding applications in various tasks driven by networked data. However, SE ignores the heterogeneity inherent in the graph relations, which is ubiquitous in modern networks. In this work, we extend SE to consider heterogeneous relations and propose the first metric for multi-relational graph structural information, namely, multi-relational structural entropy (MrSE). 
  To this end, we first cast SE through the novel lens of the stationary distribution from random surfing, which readily extends to multi-relational networks by considering the choices of both nodes and relation types simultaneously at each step. The resulting MrSE is then optimized by a new greedy algorithm to reveal the essential structures within a multi-relational network.
  Experimental results highlight that the proposed MrSE offers a more insightful interpretation of the structure of multi-relational graphs compared to SE. Additionally, it enhances the performance of two tasks that involve real-world multi-relational graphs, including node clustering and social event detection.
  %We first show that SE can be interpreted using the stationary probability distribution vector obtained from random surfing. Subsequently, we introduce MrSE derived from random surfing on multi-relational networks, considering the choices of nodes and relation types simultaneously at each step. We illustrate how to decode the essential structures within multi-relational networks using our proposed MrSE. 
\end{abstract}

\section{Introduction}\label{sec:intro}
%\blfootnote{\textsuperscript{*}This is the corresponding author.}
\begin{figure}[ht]
\begin{center}
\centerline{\includegraphics[width=0.9\columnwidth]{./image/overall.pdf}}
\caption{Decode the essential structures of a multi-relational graph with MrSE (ours) and SE. (a) is a multi-relational graph $G'$. (b) shows the essential structures of $G'$, decoded with MrSE. (c) is a single-relational graph $G$ reduced from $G'$. (d) shows the essential structures of $G$, decoded with SE.}
\label{figure:SE_vs_MrSE}
\end{center}
\end{figure}

In recent decades, graphs have become ubiquitous in our daily lives, with examples ranging from social networks and recommendation networks to publication networks, all effectively represented using graphs. Structural entropy (SE) \citep{li2016structural}, which measures the amount of structural information contained in a graph, provides a useful tool for graph analysis. Specifically, unlike various graph measures \citep{raychaudhury1984discrimination, braunstein2006laplacian, dehmer2008information, bianconi2009entropy} that are based on unstructured probability distributions, SE is interpretable \citep{liu2021bridging}. Minimizing or maximizing SE helps to disclose or obfuscate the essential structures underlying the raw, noisy graphs. Such favorable properties of SE lead to its recent applications in tasks including graph pooling \citep{wu2022structural}, community structure deception \citep{liu2019rem}, graph contrastive learning \citep{wu2023sega}, graph similarity measure \citep{liu2021bridging}, graph structure learning \citep{zou2023se}, network design \citep{liu2021bridging}, and social event detection \citep{cao2023hierarchical}.

However, SE assumes the existence of only a single type of relation between nodes, while in reality, graphs are multiplex \citep{de2013mathematical} in nature, incorporating heterogeneous relation types. For example, there may be multiple edges between two papers in a publication network, indicating shared authors, keywords, citations, accepted conferences, etc. As shown in Figures \ref{figure:SE_vs_MrSE}(a) and \ref{figure:SE_vs_MrSE}(c), analyzing multi-relational graphs with SE requires preprocessing them into single-relational graphs, which leads to information loss. This is due to the complementary nature of various relation types in revealing the graph's structure, with some relations being more informative than others. For instance, to determine if one paper is a follow-up study of another, examining the citations in addition to the keywords proves beneficial, while relying on accepted conferences may not provide as much insight. Therefore, it is essential to extend SE to consider multiple relation types.


%simultaneously consider all relation types 

In this work, we propose the first metric for multi-relational graph structural information, namely, multi-relational structural entropy (MrSE). Specifically, the original definition of SE measures the minimum number of bits required to determine the code of the node that is accessible with \textit{one step of random walk} on a single-relational graph $G$ \citep{li2016structural}, calculated from node degree statistics. Inspired by this, we propose to interpret SE with the stationary distribution vector obtained through \textit{random surfing} \citep{page1999pagerank}, i.e., taking an infinite long random walk, on $G$. Continuing with this idea, we then introduce the definition of the MrSE, incorporating random surfing on a multi-relational graph $G'$. During this multi-relational random surfing process, we simultaneously consider the choices of node and relation type at each step. The resulting stationary distribution vectors, one for nodes and one for relations, are used for MrSE calculation. We further illustrate how our proposed MrSE metric can be used to decode essential structures, such as communities, within $G'$. Through experiments on synthetic graphs, we demonstrate that our proposed MrSE outperforms SE in interpreting the structure of multi-relational graphs. Additionally, MrSE improves the performance of two real-world multi-relational graph tasks, namely node clustering and social event detection.


Our paper makes the following contributions:
\begin{itemize}
\item We introduce MrSE, the first metric designed to quantify the structural information within multi-relational graphs. Extending the favorable properties of SE, MrSE addresses heterogeneous relation types and serves as an improved tool for measuring and interpreting complex multi-relational graph structures.
\item We demonstrate how our proposed MrSE metric can decipher structures within multi-relational graphs. Introducing an algorithm for 2-dimensional (2D) MrSE minimization, we enable the detection of communities within multi-relational graphs.
%Introducing a greedy algorithm for 2-dimensional (2D) MrSE minimization, we enable the detection of communities within multi-relational graphs. Additionally, hierarchical graph partitioning can be integrated to expedite the algorithm.
\item Experiments on synthetic graphs with varying total numbers of relations, sizes, and sparsities demonstrate that our proposed MrSE, in comparison to SE, offers a more insightful interpretation of the structure of multi-relational graphs. Notably, a greater reduction is observed when employing MrSE for graph entropy minimization, indicating a more effective decoding of structural information. Furthermore, experiments on real-world multi-relational graph data show that MrSE improves the performance of two tasks, namely node clustering and social event detection.
\end{itemize}

\section{Related Works and Background}
%\section{Background}
\label{sec:prelim}
\textbf{Entropy-based Graph Metrics.} 
Measuring graph complexity is an important issue in graph analysis. To tackle this, various entropy-based graph measures \citep{raychaudhury1984discrimination, braunstein2006laplacian, dehmer2008information, bianconi2009entropy, li2016structural} have been proposed. Each of these measures represents a distinct form of Shannon entropy designed for different types of distributions extracted from the graphs. For example, the von Neumann graph entropy \citep{braunstein2006laplacian} is defined as the Shannon entropy of the Laplacian spectra. In contrast to previous metrics, SE quantifies the Shannon entropy of degree statistics, providing interpretability from an algebraic perspective. Meanwhile, all these metrics are designed for single-relational graphs. Hence, there is pressing need of a metric that can assess the complexity of multi-relational graphs.

%entropy measure in terms of structural patterns     degree sequence   a measure of degree heterogeneity    interpretability

\textbf{Structural Entropy (SE).}
%Given a single-relational graph $G = (V, E)$, where $V$ is a set of nodes and $E$ is a set of edges, SE \cite{li2016structural} is defined as the minimum number of bits to encode the vertex that is accessible with one step of random walk on $G$. SE corresponds to an \textit{encoding tree} that encodes the structure of $G$. SE exists in various dimensions, each assessing the structure of different orders and aligning with encoding trees of distinct heights. The formal definition of SE is:
Let $G = (V, E)$ be a single-relational graph, where $V$ is a set of nodes and $E$ is a set of edges. Assuming that the structure of $G$ can be represented with an \textit{encoding tree}, the formal definitions of the encoding tree and SE are as follows:

\begin{definition}
\label{definition:encoding_tree}
\citep{li2016structural} An encoding tree $\mathcal{T}$ is a tree that encodes a hierarchical partition of $V$. $\mathcal{T}$ satisfies: 1) Each node $\alpha$ in $\mathcal{T}$ is associated with a node subset $T_\alpha \subseteq V$. In particular, the root node $\lambda$ of $\mathcal{T}$ is associated with $V$, and any leaf node $\gamma$ in $\mathcal{T}$ is associated with a single node in $V$. 2) The node subsets associated with the children of $\alpha$ form a partition of $T_\alpha$. 3) Denote the height of $\alpha$ as $h(\alpha)$. Let $h(\gamma) = 0$ and $h(\alpha^-) = h(\alpha) + 1$, where $\alpha^-$ is the parent node of $\alpha$. $h(\mathcal{T}) = \underset{\alpha \in \mathcal{T}}{\max}\{h(\alpha)\}$ is the height of $\mathcal{T}$. 
\end{definition}

\begin{definition}
\label{definition:SE_kD}
\citep{li2016structural} Given a single-relational graph $G$ and an encoding tree $\mathcal{T}$, the structural entropy (SE) of $G$ relative to $\mathcal{T}$ is
\begin{equation}
\label{eq:SE_kD}
\mathcal{H}^\mathcal{T}(G) = -\underset{\alpha\in\mathcal{T}, \alpha\neq\lambda}{\sum}\frac{g_\alpha}{\mathrm{vol}(T_\lambda)}\log\frac{\mathrm{vol}(T_\alpha)}{\mathrm{vol}(T_{\alpha^-})},
\end{equation}
where $g_\alpha$ is the summation of the degrees of the cut edges of $T_\alpha$, i.e., edges in $E$ that have exactly one endpoint in $T_\alpha$. For a directed $G$, $g_\alpha$ is the summation of the in-degrees of the nodes in $T_\alpha$. $\mathrm{vol}(\cdot)$ stands for the volume, i.e., the sum of the (in-)degrees, of the associated node subset. E.g., $\mathrm{vol}(T_\alpha)$, $\mathrm{vol}(T_{\alpha^-})$, and $\mathrm{vol}(T_\lambda)$ refer to the volume of $T_\alpha$, $T_{\alpha^-}$, and $V$, respectively.
\end{definition}

Note that an encoding tree $\mathcal{T}$ is essentially a description of a graph’s structure.
%Note that an encoding tree $\mathcal{T}$ can be defined without considering the edges $E$ in a graph. 
Meanwhile, the SE values reveal how well $\mathcal{T}$ captures the structures of the graph. $\mathcal{H}^{(k)}(G)$, the $k$-dimensional SE of $G$, is defined as $\mathcal{H}^\mathcal{T}(G)$ that associated with a $\mathcal{T}$ that satisfies $h(\mathcal{T}) = k$. For $k=1$, the 1-dimensional (1D) SE, $\mathcal{H}^{(1)}(G)$, is equivalent to the Shannon entropy of the degree heterogeneity. $\mathcal{H}^{(1)}(G)$ is associated with a unique $\mathcal{T}$ of height 1, measuring the intrinsic information within $G$ without making assumptions about higher-order structures, such as communities.
For $k>1$, minimizing or maximizing $\mathcal{H}^{(k)}(G)$ is equivalent to seeking a $\mathcal{T}$ of height $k$ that reveals or hides the $k$-dimensional structures within $G$. E.g., Figures \ref{figure:SE_vs_MrSE}(c) and \ref{figure:SE_vs_MrSE}(d) visualize how minimizing the 2-dimensional (2D) SE reveals the 2D structures, i.e., communities, within $G$. The resulting $\mathcal{T}$ is optimal, i.e., corresponds to the minimum 2D SE. $\mathcal{P} = \{\alpha|\alpha \in \mathcal{T}, h(\alpha) = 1\}$ forms a partition of $V$ that highlight the communities within $G$. We provide more examples of encoding trees and the 2D SE minimization process in Appendix \ref{app:example_SE}.

SE has found various applications \citep{wu2022structural, wu2023sega, liu2021bridging, liu2019rem, li2016three, li2018decoding, cao2023hierarchical, peng2024unsupervised, zou2024multispans, zeng2024adversarial, zeng2023unsupervised, zeng2023effective, zeng2023hierarchical}. However, SE assumes the existence of only a single type of relation between nodes. This limitation calls for an improved SE measure that addresses the widespread heterogeneity of relation types.


%We denote the adjacency matrix of $G$ as $\textbf{A} \in \mathbb{R}_{+}^{|V|\times |V|}$, with entry $\textbf{A}_{i, j}$ equals the weight of the edge that starts from $j \in V$ and ends at $i \in V$. 
%For an undirected $G$, $\textbf{A}$ is symmetric. Otherwise, $\textbf{A}$ is asymmetric.

%1D SE provides an accurate, salable, and interpretable approximation of the von Neumann graph entropy. network design and graph similarity measure.


\section{Multi-relational Structural Entropy (MrSE)}
\label{sec:MrSE}
In this section, we first interpret SE from the perspective of random surfing (Section \ref{sec:RS_to_SE}). Following this intuition, we then draw inspiration from random surfing on multi-relational networks and derive the multi-relational structural entropy (MrSE) measure (Section \ref{sec:MRS_to_MrSE}). Finally, we show how to decode the structures within a multi-relational graph by minimizing the proposed MrSE (Section \ref{sec:decode_MrSE}). Appendix \ref{app:notations} summarizes the notations used in this work.

\subsection{A Random Surfing-based Interpretation of SE}
\label{sec:RS_to_SE}

As shown in Definition \ref{definition:SE_kD}, the original definition of SE is based on degree statistics. Meanwhile, we observe that the degree heterogeneity of $G$ resembles the stationary probability vector resulting from \textit{random surfing} \citep{page1999pagerank} on $G$. Leveraging this, we interpret Definition \ref{definition:SE_kD} through the lens of random surfing, as outlined below.

We denote the adjacency matrix of $G$ as $\textbf{A} \in \mathbb{R}_{+}^{|V|\times |V|}$, with entry $\textbf{A}_{j, i}$ equal the weight of the edge that starts from node $i \in V$ and ends at node $j \in V$.
We assume that $G$ is strongly connected, i.e., $\textbf{A}$ is irreducible. A surfer randomly starts from node $i \in V$. At each step, the surfer randomly steps into a neighboring node in $\{j|\textbf{A}_{j, i} \not= 0\}$. The surfing process can thus be seen as a Markov chain with transition probability matrix $\tilde{\textbf{A}}$, where $\tilde{\textbf{A}}$ is acquired from column-normalizing $\textbf{A}$ such that $\forall i, \sum_{j=1}^{|V|}\tilde{\textbf{A}}_{j, i} = 1$ holds. 
Let $\textbf{x} \in \mathbb{R}_{+}^{|V|}$ be the stationary distribution, i.e., a probability distribution that indicates where the surfer is likely to be after an infinitely long walk. $\textbf{x}$ satisfies $\textbf{x} = \tilde{\textbf{A}}\textbf{x}$ and can be calculated using the Power Method \citep{journee2010generalized}.
With $\tilde{\textbf{A}}$ and $\textbf{x}$, Equation (\ref{eq:SE_kD}) can be rewritten as:
\begin{equation}
\label{eq:SE_kD_RS}
\mathcal{H}^\mathcal{T}(G) = -\underset{\alpha\in\mathcal{T}, \alpha\neq\lambda}{\sum}p_{\rightarrow\alpha}\log\frac{p_\alpha}{p_{\alpha^-}},
\end{equation}
where $p_{\rightarrow\alpha} = {\sum}_{i \in V\backslash T_{\alpha}}\textbf{x}_i{\sum}_{j \in T_{\alpha}}\tilde{\textbf{A}}_{j,i}$, $p_\alpha = {\sum}_{i \in T_{\alpha}}\textbf{x}_i$, and $p_{\alpha^-} = {\sum}_{i \in T_{\alpha^-}}\textbf{x}_i$. 

Particularly, the 1D SE of $G$ is rewritten as $\mathcal{H}^{(1)}(G) = -\overunderset{|V|}{i = 1}{\sum}\:\textbf{x}_i\log\textbf{x}_i$, which measures the intrinsic information contained in $G$.


\begin{proposition}
Equation (\ref{eq:SE_kD}) and Equation (\ref{eq:SE_kD_RS}) give the same definition of $\mathcal{H}^\mathcal{T}(G)$.
\end{proposition}
\begin{proof} 
For the first multiplicand on the RHS of Equation~(\ref{eq:SE_kD}), we have
\begin{equation}
\label{eq:derive_SE_RSSE_1}
\begin{split}
\frac{g_\alpha}{\mathrm{vol}(T_\lambda)} & = {\sum}_{i \in V\backslash T_{\alpha}}{\sum}_{j \in T_{\alpha}}P(i,j) \\
& = {\sum}_{i \in V\backslash T_{\alpha}}P(i){\sum}_{j \in T_{\alpha}}P(j|i) \\
& = {\sum}_{i \in V\backslash T_{\alpha}}\textbf{x}_i{\sum}_{j \in T_{\alpha}}\tilde{\textbf{A}}_{j,i} \\
& = p_{\rightarrow\alpha}.
\end{split}
\end{equation}
Additionally, for the second multiplicand on the RHS of Equation (\ref{eq:SE_kD}), we have
\begin{equation}
\begin{split}
\log\frac{\mathrm{vol}(T_\alpha)}{\mathrm{vol}(T_{\alpha^-})} & = \log\frac{\mathrm{vol}(T_\alpha)}{\mathrm{vol}(T_\lambda)} - \log\frac{\mathrm{vol}(T_{\alpha^-})}{\mathrm{vol}(T_\lambda)}\\
& = \log({\sum}_{i \in T_{\alpha}}\textbf{x}_i) - \log({\sum}_{i \in T_{\alpha^-}}\textbf{x}_i)\\
& = \log\frac{p_\alpha}{p_{\alpha^-}},
\end{split}
\end{equation}

which concludes the proof.
\end{proof}

The assumption of strong connectivity for $G$ may be violated in certain situations. In such cases, \textit{stochasticity adjustment} \citep{page1999pagerank} is required to transform $G$ into a strongly connected graph.
Specifically, we replace all zero columns in $\tilde{\textbf{A}}$ with $1/|V|\textbf{e}$, where $\textbf{e}$ is a vector of ones. In addition, we make \textit{primitivity adjustment} \citep{page1999pagerank} to decrease the number of iterations needed for the Power Method to converge. Specifically, we replace $\tilde{\textbf{A}}$ with a new transition matrix $\textbf{B} = c\tilde{\textbf{A}} + (1-c)\textbf{E}$, where $(1-c)$ is the probability for the surfer to teleport to a random node and $\textbf{E}$ is the teleportation matrix. We set $\textbf{E}$ to $1/|V|\textbf{e}\textbf{e}^\top$ and $c$ to $0.85$ following \citep{page1999pagerank}. We note, nonetheless, that the selection of $c$ requires balancing two demands: 1) $c$ is small enough so that the Power Method converges fast and 2) $c$ is reasonably large so that $G$ is not over-modulated and its intrinsic structural information is kept. We propose to explore the best strategy for choosing $c$ in the future.

%we introduce a damping factor $c$ such that at each step, the surfer follows the transition matrix with a probability of $c$ and teleport to a random node with probability $(1-c)$. The adjusted surfing process is referred to as random walk with restart (RWR) \cite{page1999pagerank}. Formally, RWR replaces the transition matrix $\tilde{\textbf{A}}$ with a new transition matrix $\textbf{B} = c\tilde{\textbf{A}} + (1-c)\textbf{E}$, where $\textbf{E}$ is the teleportation matrix. We set $\textbf{E}$ to $1/|V|\textbf{e}\textbf{e}^\top$ and set $c$ to $0.85$ following the well-known PageRank \cite{page1999pagerank}, while there are various other RWR algorithms \cite{gleich2006approximating, tong2008random, shin2015bear, wang2017fora, yoon2018tpa} available. We note, nonetheless, that the selection of $c$ requires balancing two demands: 1) $c$ is small enough so that the Power Method converges fast and 2) $c$ is reasonably large so that $G$ is not over-modulated and its intrinsic structural information is kept. We propose to explore the best strategy of choosing $c$ in the future.

\subsection{From Multi-relational Random Surfing to MrSE}
\label{sec:MRS_to_MrSE}
Following the intuitions in Section \ref{sec:RS_to_SE}, we derive the first metric for multi-relational graph structural information, i.e., multi-relational structural entropy (MrSE), based on random surfing on multi-relational networks. 

We denote a multi-relational network as $G' = (V, E', R)$, where $V$, $E'$, and $R$ stand for the node, edge, and relation sets of $G'$, respectively.
%Each edge in $E'$ is associated with a relation $r \in R$. 
The adjacency tensor of $G'$ is $\textbf{A}' \in \mathbb{R}_{+}^{|V|\times |V|\times|R|}$, with entry $\textbf{A}'_{i,j,r}$ equals the weight of the edge that starts from $j \in V$, ends at $i \in V$, and associates with relation $r \in R$. At each step of the multi-relational surfing, the surfer jointly considers which neighboring \textit{node} to visit and which \textit{relation} to use. We provide examples of $G'$, $\textbf{A}'$, and multi-relational random surfing in Appendix \ref{app:example_MrSE}. Inspired by \citep{ng2011multirank}, we use two transition matrices, denoted as $\mathcal{V}$ and $\mathcal{R}$, to model the choices of the neighboring node and the relation, respectively. We assume that $\textbf{A}'$ is irreducible \citep{ng2011multirank}, i.e., 
%for any fixed $r$, the $|V|\times|V|$ matrix $(\textbf{A}'_{i,j,r})$ is irreducible. 
for any fixed $r$, a slice of $\textbf{A}'$, $(\textbf{A}'_{i,j,r}) \in \mathbb{R}_{+}^{|V|\times |V|}$ is irreducible. 
$\mathcal{V}$ and $\mathcal{R}$ are constructed as $\mathcal{V}_{i,j,r} = \textbf{A}'_{i,j,r}/\overunderset{|V|}{i = 1}\sum\textbf{A}'_{i,j,r}$ and $\mathcal{R}_{i,j,r} = \textbf{A}'_{i,j,r}/\overunderset{|R|}{r = 1}\sum\textbf{A}'_{i,j,r}$, respectively.
%\begin{equation}
%\mathcal{V}_{i,j,r} = \textbf{A}'_{i,j,r}/\overunderset{|V|}{i = 1}\sum\textbf{A}'_{i,j,r}\;\;\text{and}\;\;\mathcal{R}_{i,j,r} = \textbf{A}'_{i,j,r}/\overunderset{|R|}{r = 1}\sum\textbf{A}'_{i,j,r}.
%\end{equation}
Let $\textbf{x}' \in \mathbb{R}_{+}^{|V|}$ and $\textbf{y} \in \mathbb{R}_{+}^{|R|}$ be two probability distributions that tell us which node the surfer is likely to visit and which relation the surfer is likely to use at each step, respectively. After an infinitely long walk on $G'$, $\textbf{x}'$ and $\textbf{y}$ would converge to two stationary distributions that satisfy 
%$\mathcal{V}\textbf{x}'\textbf{y}=\textbf{x}'$ and $\mathcal{R}\textbf{x}'\textbf{x}'=\textbf{y}$, 
$\overunderset{|R|}{r = 1}\sum\overunderset{|V|}{j = 1}\sum\mathcal{V}_{i,j,r}\textbf{x}^{'}_{j}\textbf{y}_{r}=\textbf{x}^{'}_{i}$ and $\overunderset{|V|}{i = 1}\sum\overunderset{|V|}{j = 1}\sum\mathcal{R}_{i,j,r}\textbf{x}^{'}_{j}\textbf{x}^{'}_{i}=\textbf{y}_{r}$, 
respectively. $\textbf{x}'$ and $\textbf{y}$ can be calculated using the MultiRank algorithm \citep{ng2011multirank}.
%\begin{equation}
%\label{eq:MRS}
%\mathcal{V}\textbf{x}'\textbf{y}=\textbf{x}'\;\;and\;\;\mathcal{R}\textbf{x}'\textbf{x}'=\textbf{y}.
%\end{equation}
With $\mathcal{V}$, $\textbf{x}'$, and $\textbf{y}$, we introduce the definition of MrSE as follows.

\begin{definition}
\label{definition:MrSE_kD_RS}
%Given an encoding tree $\mathcal{T}$, node transition matrix $\mathcal{V}$, node and relation stationary probability distributions $\textbf{x}'$ and $\textbf{y}$ resulted from multi-relational random surfing on $G'$, the MrSE of $G'$ relative to $\mathcal{T}$ is

Given a multi-relational graph $G'$, and an encoding tree $\mathcal{T}$. Assume we have the node and relation stationary distributions $\textbf{x}'$ and $\textbf{y}$ acquired from multi-relational random surfing on $G'$ following node and relation transition matrices $\mathcal{V}$ and $\mathcal{R}$. The multi-relational structural entropy (MrSE) of $G'$ relative to $\mathcal{T}$ is
\begin{equation}
\label{eq:MrSE_kD_RS}
\mathcal{H}^\mathcal{T}(G') = -\underset{\alpha\in\mathcal{T}, \alpha\neq\lambda}{\sum}p'_{\rightarrow\alpha}\log\frac{p'_\alpha}{p'_{\alpha^-}},
\end{equation}
where $p'_{\rightarrow\alpha} 
= {\sum}_{i \in V\backslash T_{\alpha}}\textbf{x}'_i{\sum}_{j \in T_{\alpha}}{\sum}_{r \in R}{\mathcal{V}}_{j,i,r}\textbf{y}_r$, $p'_\alpha = {\sum}_{i \in T_{\alpha}}\textbf{x}'_i$, and $p'_{\alpha^-} = {\sum}_{i \in T_{\alpha^-}}\textbf{x}'_i$.
\end{definition}
Particularly, the 1D MrSE of $G'$, $\mathcal{H}^{(1)}(G') = -\overunderset{|V|}{i = 1}{\sum}\:\textbf{x}'_i\log\textbf{x}'_i$, measures the intrinsic information contained in $G'$.

\begin{proposition}
The probabilistic interpretations of SE and MrSE are identical. 
\end{proposition}
\begin{proof}
Both $p'_{\rightarrow\alpha}$ in Equation (\ref{eq:MrSE_kD_RS}) and $p_{\rightarrow\alpha}$ in Equation (\ref{eq:SE_kD_RS}) essentially mean the probability of entering community $T_\alpha$. We have
\begin{equation}
\begin{split}
p'_{\rightarrow\alpha} & = {\sum}_{i \in V\backslash T_{\alpha}}{\sum}_{j \in T_{\alpha}}P(i,j) \\
& = {\sum}_{i \in V\backslash T_{\alpha}}P(i){\sum}_{j \in T_{\alpha}}P(j|i)\\
& = {\sum}_{i \in V\backslash T_{\alpha}}P(i){\sum}_{j \in T_{\alpha}}{\sum}_{r \in R}P(j|i,r)P(r) \\
& = {\sum}_{i \in V\backslash T_{\alpha}}\textbf{x}'_i{\sum}_{j \in T_{\alpha}}{\sum}_{r \in R}{\mathcal{V}}_{j,i,r}\textbf{y}_r,
\end{split}
\end{equation}
so it aligns with the probabilistic interpretation of $p_{\rightarrow\alpha}$ as shown in Equation (\ref{eq:derive_SE_RSSE_1}).

Additionally, $p'_\alpha$ and $p'_{\alpha^-}$ in Equation (\ref{eq:MrSE_kD_RS}) stand for the probabilities of the surfer being in communities $T_\alpha$ and $T_{\alpha^-}$, respectively. Since $T_\alpha \subset T_{\alpha^-}$, the surfer has to be already in community $T_{\alpha^-}$ before they can enter $T_\alpha$. Therefore, $\log\frac{p'_\alpha}{p'_{\alpha^-}} = \log{p'_\alpha} - \log{p'_{\alpha^-}}$ is the amount of \textit{new} information, measured in bits, contained in entering $T_\alpha$. Similarly, $\log\frac{p_\alpha}{p_{\alpha^-}}$ in Equation (\ref{eq:SE_kD_RS}) also stands for the new information contained in entering $T_\alpha$.

Consequently, SE measures the amount of information contained in one step of random walk on a single-relational $G$, while MrSE is the multi-relational counterpart of SE. MrSE and SE share the same probabilistic interpretation.
\end{proof}

In the case that $\textbf{A}'$ is reducible, we need to adjust $G'$ to ensure that $\textbf{x}'$ and $\textbf{y}$ converge. Specifically, we make stochasticity adjustments to $\mathcal{V}$ and $\mathcal{R}$ such that $\forall (i,r), \sum_{j=1}^{|V|}\mathcal{V}_{j,i,r} = 1$ and $\forall (j,i), \sum_{r=1}^{|R|}\mathcal{R}_{j,i,r} = 1$. Additionally, for faster convergence, we make primitivity adjustment by replacing $\mathcal{V}$ with $c\mathcal{V} + (1-c)\textbf{E}'$, where $\textbf{E}'$ is the teleportation matrix. We set $c$ to $0.85$ and $\textbf{E}' = 1/|V|\textbf{1}$, where $\textbf{1}$ is a $|V|\times|V|\times|R|$ all-ones matrix. These choices follow the same intuition as Section \ref{sec:RS_to_SE}. Specifically, the $\textbf{E}'$ value specifies that for any relation $r \in R$, the surfer has equal chances to teleport to any of the objects.

\subsection{Decoding Multi-relational Graph Structure via MrSE Minimization}
\label{sec:decode_MrSE}
Uncovering the essential structures within the raw and noisy graphs is crucial. 2D SE minimization \citep{li2016structural} provides an effective unsupervised tool for decoding communities from single-relational graphs and has found applications in various tasks \citep{wu2022structural, wu2023sega, cao2023hierarchical}. In this section, we propose to reveal the essential structures within multi-relational graphs by minimizing the proposed MrSE metric \footnote{Meanwhile, we note that some tasks instead require concealing the essential structures within graphs, i.e., maximizing MrSE. One such example is community structure deception \citep{liu2019rem}. We defer the investigation of MrSE maximization algorithms to the future as they typically require task-specific design.}.

Firstly, following \citep{li2016structural}, a MERGE operator is defined as follows.
\begin{definition}
Given an encoding tree $\mathcal{T}$ and its two non-root nodes, $\alpha_{o_1}$ and $\alpha_{o_2}$, MERGE$(\alpha_{o_1}, \alpha_{o_2})$ removes $\alpha_{o_1}$ and $\alpha_{o_2}$ from $\mathcal{T}$ and adds a new node $\alpha_{n}$ to $\mathcal{T}$. $\alpha_{n}$ satisfies: 1) the children nodes of $\alpha_{n}$ in $\mathcal{T}$ is a combination of the children of $\alpha_{o_1}$ and $\alpha_{o_2}$; 2) ${\alpha_n}^- = \lambda$.
\end{definition}

The merge operation changes $\mathcal{T}$ and, therefore, would cause a change in the associated MrSE value. 
Based on Definition \ref{definition:MrSE_kD_RS}, the change follows:
\begin{equation}
\label{eq:merge_delta_MrSE}
\begin{split}
& \Delta\text{MrSE}_{\alpha_{o_1}, \alpha_{o_2}} = \text{MrSE}_{new} -\text{MrSE}_{old} \\
& = - p'_{\rightarrow\alpha_n}\log{p'_{\alpha_n}} - p'_{\alpha_{o_1}}\log\frac{p'_{\alpha_{o_1}}}{p'_{\alpha_n}} - p'_{\alpha_{o_2}}\log\frac{p'_{\alpha_{o_2}}}{p'_{\alpha_n}}\\
& + p'_{\rightarrow\alpha_{o_1}}\log{p'_{\alpha_{o_1}}} + p'_{\rightarrow\alpha_{o_2}}\log{p'_{\alpha_{o_2}}}.
\end{split}
\end{equation}

The derivation of Equation (\ref{eq:merge_delta_MrSE}) can be found in Appendix \ref{app:derivation}.

We propose a 2D MrSE minimization algorithm, as shown in Algorithm \ref{algorithm:2D_MrSE}.
Initially, the encoding tree $\mathcal{T}$ assumes no communities, and each node $v \in V$ is assigned to its own cluster (line 2). At this point, the 2D MrSE associated with $\mathcal{T}$ equals the 1D MrSE and represents the intrinsic structural information within $G'$. Subsequently, the minimum 2D MrSE can be achieved by greedily and repeatedly merging the two tree nodes in $\mathcal{T}$ that would result in the largest $|\Delta \text{MrSE}|$ until no further merge can lead to a $\Delta \text{MrSE} < 0$ (lines 3 - 17). 
The algorithm outputs an optimized $\mathcal{T}$, associated with the minimum 2D MrSE. At this time, $\mathcal{T}$ encodes reliable structures within $G'$ while eliminating the noisy ones. Specifically, $\mathcal{P}$, the set of the tree nodes of height one, forms a partition of $V$ that reveals the communities in $G'$. Figures \ref{figure:SE_vs_MrSE}(a) and \ref{figure:SE_vs_MrSE}(b) visualize the 2D MrSE minimization process. We also provide a more detailed visualization in Appendix \ref{app:example_MrSE}. In addition, we note that higher-dimensional (such as 3D) MrSE minimization can be realized by repeatedly applying our 2D MrSE minimization algorithm and consolidating the identified communities into nodes. 
%Algorithm \ref{algorithm:2D_MrSE} presents our 2D MrSE minimization algorithm. 
%The multi-relational random surfing in line 1 cost $O(a|E'|)$ \cite{ng2011multirank}, where $a$ is the number of iterations it takes for the MultiRank algorithm to converge.
%Initially, each node is in its own cluster (line 2). The initial encoding tree $\mathcal{T}$ (shown in Figure \ref{}) assumes no communities, and its associated MrSE is equal to the 1D MrSE, i.e., the intrinsic information, of $G'$. 
%MrSE minimization can then be achieved by greedily and repeatedly merging the two nodes in $\mathcal{T}$ that would result in the largest $|\Delta \text{MrSE}|$ until no further merge can lead to a $\Delta \text{MrSE} < 0$ (lines 3 - 14). 
%The algorithm outputs an optimized $\mathcal{T}$ associated with the minimum MrSE. At this time, $\mathcal{T}$ encodes reliable structures while eliminating the noisy ones. Especially, $\mathcal{P}$, a set of the tree nodes of height one, reveals the communities in $G'$. Moreover, we note that higher-dimensional, such as 3D, MrSE minimization can be realized by repeatedly applying our 2D MrSE minimization algorithm and consolidating the identified communities into nodes. 
%(shown in Figure \ref{})

\noindent \textbf{Time Complexity.} The multi-relational random surfing in line 1 costs $O(|E'|)$ \citep{ng2011multirank}. The construction of initial $\mathcal{T}$ in line 2 takes $O(|V|)$. The while loop in lines 3 - 17 takes $O(|V||E'|)$. The overall time complexity of Algorithm \ref{algorithm:2D_MrSE} is thus $O(|V||E'|)$. 

\noindent \textbf{Hierarchical 2D MrSE Minimization.}
Additionally, we note that hierarchical graph partitioning can be integrated to expedite the algorithm. Inspired by the hierarchical 2D SE minimization \citep{cao2023hierarchical}, we introduce a hierarchical 2D MrSE minimization algorithm (Appendix \ref{app:hier_MrSE_minimization}) that takes $O(n^3)$. Hyperparameter $n$ is the size of the subgraph under consideration at each iteration and can be set to $ o(|V|)$.
%$n$ is a hyperparameter that can be set to $ \ll |V|$.
% Parameter 'n' denotes the size of the subgraph under consideration at each iteration.
%At the same time, specific tasks, such as community structure deception \cite{liu2019rem}, depend on maximizing the entropy of graphs to conceal their structures. We defer the investigation of MrSE maximization algorithms to the future as they typically require task-specific design.

\begin{algorithm}[h]
\caption{2D MrSE minimization}\label{algorithm:2D_MrSE}
\KwIn{Multi-relational graph $G' = (V, E', R)$}
\KwOut{An optimal encoding tree $\mathcal{T}$ of height 2}
Acquire the node and relation stationary distributions $\textbf{x}'$ and $\textbf{y}$ from multi-relational random surfing on $G'$

Initialize $\mathcal{T}$, s.t. for each node $v \in V$, add two nodes $\alpha$ and ${\alpha}^-$ to $\mathcal{T}$.  $\alpha$ is a leaf node of $\mathcal{T}$ and $T_\alpha = \{v\}$. ${\alpha}^-$ is the parent of $\alpha$ and $h({\alpha}^-) = 1$

\While{True}{
    $\mathcal{P} \gets (\alpha|\alpha \in \mathcal{T}, h(\alpha) = 1)$
    
    $\Delta \text{MrSE} \gets \infty$
    
    \For{$\alpha_i \in \mathcal{P}$}{
    
        \For{$\alpha_j \in \mathcal{P}, j > i$}{
        
            \If{there are edges between $T_{\alpha_i}$ and $T_{\alpha_j}$}{
                 $\Delta \text{MrSE}_{ij} \gets$ Eq. (\ref{eq:merge_delta_MrSE}), w/o actually merging $\alpha_i$ and $\alpha_j$\\
                \If{$\Delta \text{MrSE}_{ij} < \Delta \text{MrSE}$}{
                    $\Delta \text{MrSE} = \Delta \text{MrSE}_{ij}$\\
                    $\alpha_{o_1} = \alpha_i$\\
                    $\alpha_{o_2} = \alpha_j$
                }
            }
        }
    }
    \If{$\Delta \text{MrSE} < 0$}{
        MERGE$(\alpha_{o_1}, \alpha_{o_2})$ 
    }\Else{
        Break
    }
}
\Return{$\mathcal{T}$}
\end{algorithm}


\section{Experiments}
\label{sec:experiments}
We show that the proposed random surfing-based SE (denoted as RSSE for simplicity) approximates the original SE. Moreover, we show that as compared to SE, MrSE provides a better metric for multi-relational graph structures.
We experiment on synthetic graphs (Section \ref{sec:synthetic}) as well as tasks that involve real-world multi-relational graphs, namely node clustering (Section \ref{sec:node_clustering}) and social event detection (Section \ref{sec:social_event_detection}). 
Our code is publicly available \footnote{\href{https://github.com/YuweiCao-UIC/MrSE.git}{https://github.com/YuweiCao-UIC/MrSE.git}}.

\subsection{Simulation Experiments}
\label{sec:synthetic}
We conduct a study on single- and multi-relational synthetic graphs generated using the Barabasi-Albert (BA) model \citep{albert2002statistical}. The BA model incorporates two important general concepts that exist widely in real-world networks: growth, i.e., the network increases over time, and preferential attachment, i.e., the more connected a node is, the more likely it is to receive new links. We proceed to describe our graph generation process.

\textbf{Synthetic data.} To generate single-relational graphs, we adopt the BA graph generator from PyG \footnote{\href{https://pyg.org/}{https://pyg.org/}}. To regulate graph sparsity, edges are randomly dropped out to align with the desired sparsity level. To generate multi-relational graphs, we create multiple BA graphs of identical sizes and then concatenate them along the relation axis. Note that the BA graphs associated with each relation are generated independently, assuming no correlations between the relations. 

\textbf{Experiment setup.} The calculation of 1D SE, RSSE, and MrSE follow Definition \ref{definition:SE_kD}, Equation (\ref{eq:SE_kD_RS}), and Definition \ref{definition:MrSE_kD_RS}, respectively. The calculation of the minimum 2D SE and RSSE follow the 2D SE minimization algorithm in \citep{li2016structural}, while the calculation of the minimum MrSE follows the proposed 2D MrSE minimization algorithm (Algorithm \ref{algorithm:2D_MrSE}). Note that SE, RSSE, and the 2D SE minimization algorithm are metrics and algorithms designed for single-relational graphs. To apply them on a multi-relational graph $G'$, we preprocess $G'$ by ignoring the heterogeneity in its relations and mapping $G'$ into a single-relational graph $G$, as visualized by Figures \ref{figure:SE_vs_MrSE} (a) and \ref{figure:SE_vs_MrSE} (c).

\textbf{Compare random surfing-based SE to original SE.} Figure \ref{fig:compare_SE} presents the 1D and the \textit{minimum} 2D (denoted as 2D in the legend for simplicity) SE and RSSE of single-relational graphs with varying sizes and sparsities. 
In Figure \ref{fig:compare_SE} (a), the 1D SE and RSSE values increase with the graph size. This means larger graphs contain more structural information. In addition, the minimum 2D SE value also increases with the graph size, indicating that larger graphs contain more noise (this noise refers to the structural information that the optimal encoding tree, derived from the 2D SE minimization process, struggles to interpret). Meanwhile, the 1D RSSE values closely match the 1D SE values, and the minimum 2D RSSE values closely align with the minimum 2D SE values. This alignment suggests that our proposed random surfing-based method is a reliable approximation of the original SE.
%Similarly, in Figure \ref{fig:compare_SE} (b), denser graphs contain more structural information, as reflected in the rising 1D SE and RSSE values. The rising minimum 2D SE and RSSE values indicate an increase in noise within denser graphs. 
Likewise, Figure \ref{fig:compare_SE} (b) suggests that denser graphs encompass greater structural information and are more noisy.
In addition, our proposed random surfing-based method effectively approximates the original SE, except for very sparse (sparsity $>$ 98\%) graphs. When the graph is sparse, both 1D and minimum 2D RSSE are higher than that of the original SE. This is caused by the imagined edges and the teleportation matrix introduced during the stochasticity and primitivity adjustments (detailed in Section \ref{sec:MRS_to_MrSE}).

%\textbf{Measure multi-relational graph structural information with MrSE (RQ2).} We compare our proposed MrSE metric to SE on multi-relational graphs. Figures \ref{} and \ref{} show the 1D MrSE and 1D SE (measuring SE requires mapping the multi-relational graphs into single-relational ones) of multi-relational graphs with varying sizes and sparsities, respectively. We can tell that the 1D MrSE is always higher than the 1D SE, suggesting that our proposed MrSE metric encodes more structural information within multi-relational graphs. Specifically, SE only captures the node-wise structures, while MrSE further captures the relation-wise structures. I.e., MrSE provides a better metric for multi-relational graph structural information as compared to SE. 
%misses the structural information contained in the relation dimension

\textbf{Decode multi-relational graph structural information.} We compare the effectiveness of MrSE, RSSE, and SE in decoding the structural information of multi-relational graphs. For each multi-relational graph $G'$, we measure $\Delta \text{MrSE} = (\text{1D MrSE} - \text{minimum 2D MrSE})/\text{1D MrSE}$, which represents the fraction of the structural information within $G'$ that successfully decoded by minimizing 2D MrSE. The larger $\Delta \text{MrSE}$ is, the more effective MrSE is at deciphering the structure of $G'$. We measure $\Delta \text{SE}$ and $\Delta \text{RSSE}$ in similar manners, except that the heterogeneity in the relations of $G'$ is ignored, and 2D SE minimization \citep{li2016structural} instead of our proposed 2D MrSE minimization algorithm is applied. Figures \ref{fig:delta_MrSE}(a), \ref{fig:delta_MrSE}(b), and \ref{fig:delta_MrSE}(a) present the $\Delta$SE, $\Delta$RSSE, and $\Delta$MrSE of multi-relational graphs with varying sizes, total number of relations, and sparsities, respectively. We can tell that as the graph size and total number of relations increase, $\Delta \text{SE}$, $\Delta \text{RSSE}$, and $\Delta \text{MrSE}$ show a declining pattern, while they exhibit an ascending trend with sparsity. This suggests that graphs that are larger, denser, and contain more complex relations are more difficult to decipher. Moreover, it is evident that $\Delta \text{MrSE}$ consistently surpasses $\Delta \text{SE}$ and $\Delta \text{RSSE}$, despite the changes in graph size, total number of relations, and sparsity. This suggests that our proposed MrSE, compared to SE and RSSE, offers a more effective tool for measuring and decoding the structural information in multi-relational graphs.

%\textbf{Decode multi-relational graph structural information with MrSE (RQ3).} We compare our proposed 2D MrSE minimization with 2D SE minimization \cite{li2016structural} on decoding the structural information of multi-relational graphs. Given a multi-relational graph $G'$, we first calculate its 1D MrSE, then apply 2D MrSE minimization to acquire the minimized 2D MrSE of $G'$. $\Delta \text{MrSE} = (\text{1D MrSE} - \text{2D MrSE})/\text{1D MrSE}$ represents the fraction of the structural information within $G'$ that our 2D MrSE minimization algorithm successfully decoded. The larger $\Delta \text{MrSE}$ is, the more effective our algorithm is at deciphering the structure of $G'$. Similarly, $\Delta \text{SE} = (\text{1D SE} - \text{2D SE})/\text{1D SE}$ measures the effectiveness of the 2D SE minimization algorithm in decoding the structure of $G$, a single-relational graph mapped from $G'$. Figures \ref{}, \ref{}, and \ref{} show the $\Delta$ MrSE and $\Delta$ SE of multi-relational graphs with varying number of relations, sizes, and sparsities, respectively. It is evident that $\Delta \text{MrSE}$ consistently surpasses $\Delta \text{SE}$, suggesting that our proposed 2D MrSE minimization, when compared to 2D SE minimization, offers a more efficient approach for decoding the structural information in multi-relational graphs.

%To evaluate our proposed MrSE and 2D MrSE minimization algorithm (RQ2 and RQ3), multi-relational
 



\begin{figure}
     \centering
     \begin{subfigure}[b]{5.5cm}
         \centering
         \includegraphics[width=\columnwidth]{./image/SE_v4_3.png}
         \caption{}
         %\caption{SE with varying graph sizes}
     \end{subfigure}
     %\bigskip
     %\hfill
     \begin{subfigure}[b]{5.2cm}
         \centering
         \includegraphics[width=\columnwidth]{./image/SE_sparsity_v4_4.png}
         \caption{}
         %\caption{SE with varying sparsities}
     \end{subfigure}
        \caption{The 1D and 2D SE and RSSE of single-relational graphs with varying sizes (a) and sparsities (b).}
        \label{fig:compare_SE} 
\end{figure}



\begin{figure*}
     \centering
     \begin{subfigure}[b]{5.6cm}
         \centering
         \includegraphics[width=\columnwidth]{./image/multi_delta_SE_nodes_2.pdf}
         \caption{}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{5.6cm}
         \centering
         \includegraphics[width=\columnwidth]{./image/multi_delta_SE_relations_2.pdf}
         \caption{}
         %\caption{NDCG@5 w.r.t. $w_3$}
     \end{subfigure}
     \hfill
     \begin{subfigure}[b]{5.6cm}
         \centering
         \includegraphics[width=\columnwidth]{./image/multi_delta_SE_sparsity_2.pdf}
         \caption{}
         %\caption{Loss values of lines 23}
     \end{subfigure}
    
        \caption{The $\Delta$SE, $\Delta$RSSE, and $\Delta$MrSE of multi-relational graphs with varying sizes (a), the total number of relations (b), and sparsities (c).}
        \label{fig:delta_MrSE}
\vspace{-1em}
\end{figure*}

\subsection{Multi-relational Node Clustering}
\label{sec:node_clustering}
Unsupervised node clustering is an essential task in graph analysis. In this section, we evaluate our proposed MrSE on multi-relational graph node clustering.
%In this section, we conduct node clustering on multi-relational graphs to validate the effectiveness of our proposed 2D MrSE minimization algorithm.

\noindent \textbf{Datasets.} Following previous multi-relational graph embedding studies \citep{park2020unsupervised}, we evaluate on IMDB \citep{fu2020magnn}, DBLP \citep{fu2020magnn}, and ACM \citep{lv2021we}. IMDB is a movie dataset. The movies
are divided into three classes (action, comedy, drama) according to their genre. Movie features correspond to the bag-of-words representations of plots. DBLP is a publication dataset containing authors that are labeled according to their research areas (database, data mining, machine learning, information retrieval). Author features are the bag-of-words representations of keywords. ACM is a publication dataset containing papers divided into three classes (database, wireless communication, and data mining). Paper features correspond to the bag-of-words representations of keywords. Following \citep{park2020unsupervised}, the multi-relational edges are inferred via intermediate nodes (e.g., for IMDB, the `M-A-M' edges are inferred via actor nodes, and the `M-D-M' edges are inferred via director nodes). Table \ref{table:node_clustering_data} shows the data statistics. 

%https://arxiv.org/pdf/1911.06750.pdf
\begin{table}[t]
\caption{Statistics of the node clustering datasets. For IMDB, M, A, and D denote movie, actor, and director; for DBLP, A, P, T, and C denote author, paper, term, and conference; for ACM, P, C, A, S, and T denote paper, cite, author, subject, and term.}
\centering
\scriptsize
\begin{adjustbox}{width=\linewidth}
    \begin{tabular}{c|ccccc}
    \toprule
    %\hline
    Dataset & $|V|$ & $R$ & $|E'|$ & Sparsity (\%) & $|Y|$ \\
    \midrule
    %\hline
    \multirow{4}{*}{ACM} & \multirow{4}{*}{3,025} & P-C-P & 5,335 & 99.88 & \multirow{4}{*}{3} \\
    &  & P-A-P & 13,374 & 99.71 & \\
    &  & P-S-P & 1,107,032 & 75.80 & \\
    &  & P-T-P & 4,573,785 & $<$0.01 & \\
    \midrule
    %\hline
    \multirow{3}{*}{DBLP} & \multirow{3}{*}{4,057} & A-P-A & 3,528 & 99.96 & \multirow{3}{*}{4} \\
    &  & A-P-T-P-A & 3,519,757 & 57.22 &  \\
    &  & A-P-C-P-A & 2,498,219 & 69.64 &  \\
    \midrule
    %\hline
    \multirow{2}{*}{IMDB} & \multirow{2}{*}{4,278} & M-A-M & 40,540 & 99.56 & \multirow{2}{*}{3} \\
    &  & M-D-M & 6,584 & 99.93 \\
    \bottomrule
    %\hline
    \end{tabular}
\end{adjustbox}
  \label{table:node_clustering_data}
\vspace{-1.5em}
\end{table}

\noindent \textbf{Baselines.} We compare the proposed \textbf{MrSE} to \textbf{SE} \citep{li2016structural} and \textbf{RSSE}. We further compare to a deep learning-based spectral clustering method, i.e., \textbf{SpectralNet} \citep{shaham2018spectralnet}. We also consider random walk-based methods, including \textbf{node2vec} \citep{grover2016node2vec}, which learns node embeddings with random walks and skip-gram, and \textbf{metapath2vec} \citep{dong2017metapath2vec}, which performs metapath-based random walk.
In addition, we compared to GNN-based methods, including
\textbf{DGI} \citep{velivckovic2018deep}, which maximizes global-local mutual information; \textbf{DMGI} \citep{park2020unsupervised}, which is the multi-relational counterpart of DGI; \textbf{DMoN} \citep{tsitsulin2023graph}, which maximizes graph modularity \citep{newman2006finding}. We additionally perform \textbf{k-means} clustering using node features to gauge their informativeness.
All methods are unsupervised. Among them, metapath2vec, DMGI, and MrSE explore the heterogeneity of relations. Note that the GNN-based methods, i.e., DGI, DMGI, and DMoN, leverage node features in addition to graph structure, which gives them an extra edge over methods that rely solely on graph structure.
% kmeans?

\noindent \textbf{Metrics.} Following \citep{park2020unsupervised,tsitsulin2023graph}, we report normalized mutual information (NMI). We further report adjusted rand index (ARI), and unsupervised clustering accuracy (ACC, in Appendix \ref{app:node_clustering_acc}), which are commonly adopted clustering metrics.

\noindent \textbf{Experiment setup.} To evaluate single-relational methods including SpecturalNet, DMoN, DGI, SE, and RSSE, we preprocess the multi-relational datasets by mapping them into single-relational ones. Additionally, following \citep{park2020unsupervised}, we explore leveraging heterogeneous relations with the single-relational embedding methods, i.e., SpecturalNet, DMoN, and DGI. Specifically, we
obtain the final node embedding by averaging the node embeddings obtained from single-relational graphs that correspond to each relation. For node2vec and metapath2vec, we extend the graphs to contain the intermediate nodes (e.g., the actor and director nodes for IMDB). For k-means, we set the number of clusters to the ground truth, i.e., 3, 4, 3 for ACM, DBLP, and IMDB,
respectively. Similarly, for the representation-learning models including node2vec, metapath2vec, DGI, and DMGI, we perform k-means clustering after learning representations, setting the number of clusters to the ground truth to obtain final community predictions. We implement SE, RSSE, and MrSE using Python. For SpectralNet, we use the source code provided by the authors \footnote{\href{https://github.com/shaham-lab/SpectralNet}{https://github.com/shaham-lab/SpectralNet}}. For the rest models, we leverage the implementations from the PyG package. We repeat all experiments 5 times and average results across runs. The method-specific hyperparameters are decided according to the original papers and are provided in Appendix~\ref{app:node_clustering_setting}.
%All models are tuned to achieve their optimal performance, and detailed hyperparameters are provided in Appendix \ref{}.
%%================

%%================
\noindent \textbf{Node clustering results.}
Table \ref{table:node_clustering_data} presents the node clustering results. MrSE outperforms all baselines in both NMI and ARI on the ACM and DBLP datasets. On the IMDB dataset, MrSE achieves the highest NMI but falls short in terms of ARI. These results highlight the strong capability of MrSE in identifying communities within multi-relational graphs. This is particularly noteworthy given that certain methods, i.e., the GNN-based ones, utilize graph structure and node features, whereas MrSE relies exclusively on graph structure. 

For the ACM dataset, which contains extremely dense relations (e.g., `P-T-P'), single-relational methods show near-zero results or fail to run. Addressing the heterogeneity of relations, either by averaging per-relation embeddings, introducing metapaths, or applying distinct weights to relations, results in a substantial performance boost. E.g., DGI with $G', X$ input outperforms DGI with $G, X$ input, metapath2vec outperforms node2vec, and MrSE outperforms SE and RSSE.
For the DBLP and IMDB datasets, the strategy of averaging per-relation embeddings proves to be less effective in handling heterogeneous relations, as evidenced by the lower performance of DGI with $G', X$ input compared to DGI with $G, X$ input. Somewhat surprisingly, DMGI performs worse than DGI with $G', X$ input on all three datasets, suggesting that weighting per-relation embeddings with attention is less effective than simply averaging them. 

In addition, we observe that SE and RSSE resemble each other across datasets and metrics, showing that RSSE  effectively approximates SE. Moreover, MrSE consistently outperforms SE and RSSE by large margins, indicating that MrSE offers a better metric for interpreting the structural information within multi-relational graphs.

% 
% DMGI

\begin{table}[t]
\caption{Node clustering results (\%). `/' indicates that SpectralNet fails to run on ACM.}
\centering
\scriptsize
\begin{adjustbox}{width=\linewidth}
    \begin{tabular}{c|c|cc|cc|cc}
    \toprule
    %\hline
    \multirow{2}{*}{Method} & \multirow{2}{*}{Input} & \multicolumn{2}{c}{ACM} &\multicolumn{2}{c}{DBLP} & \multicolumn{2}{c}{IMDB}\\
     & & NMI & ARI & NMI & ARI & NMI & ARI \\
    %\hline
    \midrule
    k-means & $X$ & 25.80 & 16.32 & 20.65 & 7.37  & 3.59 & 0.00 \\
    %\hline
    \midrule
    DMoN & $G$, $X$ & 0.00 & 0.00 & 38.23 & 6.50 & 4.53 & 0.61\\
    DGI & $G$, $X$ & 0.32 & 0.01 & 33.38 & 12.85  & 7.70 & \textbf{9.07}\\
    %\hline
    \midrule
    SpectralNet & $G$ & / & /  & 39.09 & 8.44 & 3.83 & 0.69 \\
    node2vec & $G$ & 0.09 & 0.03  & 27.04 & 15.90 & 2.75 & 3.01 \\
    SE & $G$ & 3.16 & 3.16 & 39.10 & \underline{41.84}  & \underline{8.82} & 0.17 \\
    RSSE (ours) & $G$ & 3.56 & 3.66  & 39.12 & 41.06 & \underline{8.82} & 0.18 \\
    %\hline
    \midrule
    DMoN & $G'$, $X$ & 21.10 & 8.82 & 15.99 & 6.84 & 1.27 & 0.47  \\
    DGI & $G'$, $X$ & 47.03 & \underline{43.66}  & 34.29 & 29.79 & 0.44 & 0.00\\
    DMGI & $G'$, $X$ & 25.53 & 20.50 & 1.43 & 1.21 & 0.63 & 0.45 \\
    %\hline
    \midrule
    SpectralNet & $G'$ & / & /  & 36.58 & 30.88 & 1.16 & 0.15\\
    metapath2vec & $G'$ & \underline{47.51} & 42.90 & \underline{45.56} & 36.76  & 5.34 & \underline{5.10} \\
    MrSE (ours) & $G'$ & \textbf{48.36} & \textbf{55.80} & \textbf{49.26} & \textbf{55.78} & \textbf{13.68} & 0.19 \\
    %\hline
    \bottomrule
    \end{tabular}
\end{adjustbox}
  \label{table:node_clustering_data}
\end{table}



\begin{table}[t]
\caption{Social event detection results (\%), averaged over all blocks.}
\centering
\begin{adjustbox}{width=0.85\linewidth}
    \begin{tabular}{c|c|c|cc}
    \toprule
    Dataset & Metric & HISEvent  & RSSE (ours) & MrSE (ours) \\
    \midrule
    \multirow{2}{*}{Event2012} & NMI & 82.94 &  \underline{83.01} & \textbf{84.17} \\
     & ARI & \underline{63.15}  & 63.10  &  \textbf{64.17}  \\
    \midrule
    \multirow{2}{*}{Event2018} & NMI & \underline{76.08}  &  75.82 &  \textbf{76.64} \\
     & ARI & \underline{60.25}  & 59.39 &  \textbf{60.91} \\
    \bottomrule
    \end{tabular}
\end{adjustbox}
  \label{table:social_event_detection_results}
\vspace{-1em}
\end{table}

\subsection{Social Event Detection}
\label{sec:social_event_detection}

We note that the proposed MrSE, serving as the multi-relational counterpart to SE, can enhance the extensive applications of SE by tackling heterogeneous relations. One such application is social event detection, which is commonly formalized as extracting clusters of co-related messages from streams of social media messages. \citep{cao2023hierarchical} achieves SOTA social event detection performance using 2D SE minimization. However, it overlooks the heterogeneity of message correlations. In this section, we explore social event detection using the proposed 2D MrSE minimization and observe the performance changes resulting from the introduction of heterogeneous message correlations.


\noindent \textbf{Datasets.} We experiment on two large, public Twitter datasets, i.e., Event2012 \citep{mcminn2013building} and Event2018 \citep{mazoyer2020french}. 
Within Event2012, there are 68,841 English tweets associated with 503 events, spanning a four-week period. Event2018 consists of 64,516 French tweets discussing 257 events and occurring over a 23-day period.
We adopt the data splits of \citep{cao2023hierarchical} to evaluate under the open-set settings, which assumes the events happen over time and splits the datasets into day-wise message blocks. Data statistics are in Appendix \ref{app:social_event_data}.


\noindent \textbf{Baselines.} We compare to \textbf{HISEvent} \citep{cao2023hierarchical}, which is the current SOTA of social event detection. 
It begins by constructing \textit{message graphs}, where nodes represent messages and correlated messages are connected (these correlations may arise from shared senders, mentioned users, hashtags, named entities, or similar natural language semantics. Such heterogeneity is ignored).
Subsequently, it partitions the message graphs using 2D SE minimization to extract social events, which are represented by clusters of messages. 
Note that we omit the direct comparison with various social event detectors that HISEvent has outperformed, including ones that leverage GNN \citep{cao2021knowledge, ren2022known, peng2022reinforced, ren2023uncertainty}, betweenness centrality \citep{liu2020story}, TF-IDF \citep{bafna2016document}, LDA \citep{blei2003latent}, etc.

\noindent \textbf{Metrics.} Following previous social event detection studies \citep{cao2023hierarchical}, we report NMI and ARI.

\noindent \textbf{Experiment setup.} For HISEvent, we use the source code provided by the authors \footnote{\href{https://github.com/SELGroup/HISEvent}{https://github.com/SELGroup/HISEvent}}.
To evaluate RSSE, we simply replace the SE in HISEvent with RSSE. To evaluate MrSE, we make two changes: first, we explore the heterogeneity of message correlations and construct \textit{multi-relational message graphs} (detailed in Appendix \ref{app:social_event_graph}); second, we replace the 2D SE minimization in HISEvent with the proposed 2D MrSE minimization. For all three methods, we adopt the same hyperparameters as specified in the HISEvent paper. 


\noindent \textbf{Social event detection results.} Table \ref{table:social_event_detection_results} presents the social event detection results. MrSE demonstrates superior performance compared to HISEvent across datasets and metrics. This suggests that by delving into heterogeneous message correlations, the proposed MrSE enhances the efficacy of social event detection in comparison to HISEvent, which relies on the original SE and overlooks the heterogeneity in message correlations. Furthermore, RSSE performs comparably to HISEvent, which utilizes SE. This suggests that RSSE and SE can be used interchangeably.


\section{Conclusion}
\label{sec:conclusion}
In this study, we propose MrSE, the first metric of multi-relational graph structural information. We begin by reexamining the original definition of SE from the viewpoint of random surfing. Subsequently, the definition of MrSE is derived from random surfing on multi-relational graphs. Additionally, we introduce a 2D MrSE minimization algorithm designed to unveil communities within these complex graphs. Results from experiments on both synthetic and real-world graphs, including movie, publication, and social message networks, demonstrate that the proposed MrSE is a powerful metric for assessing and unraveling the structural information within multi-relational graphs. MrSE exhibits strong performance in two tasks, namely multi-relational node clustering and social event detection.


\section*{Acknowledgments}
We thank Prof. Xinhua Zhang for his helpful discussions and suggestions.
This work is supported by the National Key R\&D Program of China through grant 2022YFB3104700, NSFC through grants 62322202 and 61932002, Beijing Natural Science Foundation through grant 4222030, Guangdong Basic and Applied Basic Research Foundation through grant 2023B1515120020, S\&T Program of Hebei through grants 20310101D and 21340301D, and Shijiazhuang Science and Technology Plan Project through grant 231130459A.
Philip S. Yu is supported in part by NSF under grant III-2106758. 













% References
%\bibliography{uai2024-template}
\bibliography{MrSE}

\newpage

%\onecolumn

%\title{Title in Title Case\\(Supplementary Material)}
%\maketitle

\newpage
\appendix
%\onecolumn
\section{Notations}
\label{app:notations}

Table \ref{table:notation} summarizes the main notations used in this paper.

\begin{table}[t]
    \resizebox{\linewidth}{!}{%
    \begin{tabular}{r|l}  
    \hline
      \textbf{Notation} & \textbf{Description}\\
      \hline
      $G = (V, E)$ & Single-relational graph with node set $V$ \\ & and edge set $E$  \\ 
      $\textbf{A}$ & $\textbf{A} \in \mathbb{R}_{+}^{|V|\times |V|}$, the adjacency matrix of $G$ \\
      $\tilde{\textbf{A}}$ & Transition probability matrix of random surfing \\
      %Column-normalized $\textbf{A}$\\
      $\textbf{B}$ & $\tilde{\textbf{A}}$ after primitivity adjustment \\
      $\textbf{E}$ & Teleportation matrix \\
      $\textbf{x}$ & The stationary distribution of random surfing \\
      $G' = (V, E', R)$ & Multi-relational graph with node set $V$, \\ & heterogeneous edge set $E'$, and relation set $R$  \\ 
      $\textbf{A}'$ & $\textbf{A}' \in \mathbb{R}_{+}^{|V|\times |V|\times|R|}$, the adjacency matrix of $G'$ \\
      $\mathcal{V};\mathcal{R}$ & Node and relation transition probability matrices \\ & of multi-relational random surfing\\
      $\textbf{E}$ & Multi-relational teleportation matrix \\
      $\textbf{x}';\textbf{y}$ & Node and relation stationary distributions of  \\ & multi-relational random surfing \\
      $\mathcal{T}$ & Encoding tree\\
      $\alpha, \lambda, \gamma \in \mathcal{T}$ & Node, root node, leaf node in $\mathcal{T}$\\
      $\alpha^-$ & The parent of $\alpha$\\
      $T_\alpha, T_\lambda, T_\gamma \in V$ & Node sets $\subseteq V$ that associate with  $\alpha, \lambda, \gamma$\\
      $h(\alpha);$ & Height of $\alpha$ \\
      $h(\mathcal{T});$ & Height of $\mathcal{T}$ \\
      $g_\alpha$ & Summation of the degrees of the cut edges of $T_\alpha$\\
      $\mathrm{vol}(T_\alpha);\mathrm{vol}(T_\lambda)$ & Volume of $T_\alpha$; Volume of $T_\lambda$, i.e., $V$\\
      $\mathcal{P}$ & A partition of $V$\\
      $\mathcal{H}^{\mathcal{T}}(G)$ & The structural entropy (SE) of $G$ relative to $\mathcal{T}$ \\
      $\mathcal{H}^{(k)}(G)$ & The $k$-dimensional SE of $G$\\
      $\mathcal{H}^{\mathcal{T}}(G')$ & The multi-relational structural entropy (MrSE) \\ & of $G'$ relative to $\mathcal{T}$ \\
      $\mathcal{H}^{(k)}(G')$ & The $k$-dimensional MrSE of $G'$\\
      $p_{\rightarrow\alpha}$ & The probability of entering community $T_\alpha$ \\ & during random surfing \\
      $p_{\alpha}$ & The probability of being in community $T_\alpha$  \\ & during random surfing \\
      $p'_{\rightarrow\alpha}$ & The probability of entering community $T_\alpha$ \\ & during multi-relational random surfing \\
      $p'_{\alpha}$ & The probability of being in community $T_\alpha$  \\ & during multi-relational random surfing \\ 
      \hline
    \end{tabular}}
    \caption{Glossary of Notations.}
    \label{table:notation}
\end{table}

\begin{figure*}[ht]
\begin{center}
\centerline{\includegraphics[width=1.7\columnwidth]{./image/SE.pdf}}
\caption{Examples of single-relational graph, encoding tree, and 2D SE minimization. (a) is a single-relational graph $G$. (b) is the encoding tree of height 1, which encodes the 1st-order structures, i.e., nodes, in $G$. (c) - (e) demonstrate how 2D SE minimization detects the 2nd-order structures, i.e., communities, in $G$. Initially, each node in $G$ is assigned to its own cluster. $\mathcal{P}$ in (c) shows the initial clusters. Following the vanilla greedy 2D SE minimization algorithm \cite{li2016structural}, at each step, any two clusters that would reduce SE the most are merged. Eventually, the optimal encoding tree of height 2, as shown in (e), is associated with the minimum possible SE value, and encodes the communities, in $G$. $\mathcal{P}$ in (e) shows the detected communities.}
\label{figure:SE}
\end{center}
\end{figure*}

\begin{figure*}[ht]
\begin{center}
\centerline{\includegraphics[width=2\columnwidth]{./image/MrSE.pdf}}
\caption{Examples of multi-relational graph, encoding tree, and 2D MrSE minimization. (a) is a multi-relational graph $G'$. (b) is the adjacency tensor of $G'$. (c) is the encoding tree of height 1, which encodes the 1st-order structures, i.e., nodes, in $G'$. (d) - (f) demonstrate how 2D MrSE minimization detects the 2nd-order structures, i.e., communities, in $G'$. Initially, each node in $G'$ is assigned to its own cluster. $\mathcal{P}$ in (d) shows the initial clusters. Following our proposed 2D MrSE minimization algorithm (Algorithm \ref{algorithm:2D_MrSE}), at each step, any two clusters that would reduce MrSE the most are merged. Eventually, the optimal encoding tree of height 2, as shown in (f), is associated with the minimum possible MrSE value, and encodes the communities, in $G'$. $\mathcal{P}$ in (f) shows the detected communities.}
\label{figure:MrSE}
\end{center}
\end{figure*}

\section{Examples of 2D SE minimization}
\label{app:example_SE}
We provide examples of single-relational graph, encoding trees, and the 2D SE minimization process in Figure \ref{figure:SE}. 

Note that an encoding tree is essentially a description of a graph’s structure. For a graph $G$, the encoding tree of height 1 is unique, containing one root node and $|V|$ leaf nodes, each corresponds to a node in $G$. In this way, the encoding tree of height 1 simply describes the fact that $G$ has $|V|$ nodes, and makes no assumptions about higher-order structures, such as communities. Figure \ref{figure:SE} (b) shows an example of an encoding tree of height 1. On the other hand, an encoding tree of height 2 has an intermediate layer between the root node and the leaf nodes. This intermediate layer describes the 2nd-order structures, i.e., communities in the graph. Since there are different ways to partition the nodes in $G$, a $G$ can have many encoding trees of height 2. In the task of community detection, the goal is to find the optimal encoding tree of height 2, i.e., the one that is associated with the minimized SE. Figures \ref{figure:SE} (c) - (e) show examples of encoding trees of height 2, among which (e) is the optimal one.

Figures \ref{figure:SE} (c) - (e) illustrate community detection through 2D SE minimization. Initially, each node in $G$ is assigned to its own cluster. $\mathcal{P}$ in (c) shows the initial clusters. Following the vanilla greedy 2D SE minimization algorithm \cite{li2016structural}, at each step, any two clusters that would reduce SE the most are merged. Eventually, the optimal encoding tree of height 2, as shown in (e), is associated with the minimum possible SE value, and encodes the communities, in $G$. $\mathcal{P}$ in (e) shows the detected communities.

\section{Examples of 2D MrSE minimization}
\label{app:example_MrSE}
We provide examples of multi-relational graph, encoding trees, and the 2D MrSE minimization process in Figure \ref{figure:MrSE}. 

As an example of multi-relational random surfing, consider $G'$ and $\textbf{A}'$ as shown in Figure \ref{figure:MrSE} (a) and (b). At each step of the multi-relational surfing on $G'$, the surfer follows $\textbf{A}'$ to randomly and jointly decide which neighboring node to visit as well as which relation to use. E.g., assume that the surfer is at node $v_1$. Through relation $R_1$, they can choose to visit $v_2$ or $v_5$, as $\textbf{A}_{2,1,R_1}^{'} = 1$ and $\textbf{A}_{5,1,R_1}^{'} = 1$. Similarly, through $R_2$, they can choose to visit $v_2$ or $v_5$, as $\textbf{A}_{2,1,R_2}^{'} = 1$ and $\textbf{A}_{5,1,R_2}^{'} = 1$. Finally, through $R_3$, the surfer can choose to visit $v_3$, as $\textbf{A}_{3,1,R_3}^{'} = 1$. In this manner, the surfer takes an infinite long random walk on $G'$.

The 2D MrSE minimization process is similar to the 2D SE minimization, shown in Figure \ref{figure:SE}. The distinction is that the proposed MrSE, instead of SE, is utilized to determine which two clusters to merge at each step.


\section{Hierarchical 2D MrSE Minimization}
\label{app:hier_MrSE_minimization}
Inspired by how \cite{cao2023hierarchical} hierarchically minimizes 2D SE, we propose to speed up Algorithm \ref{algorithm:2D_MrSE} with hierarchical graph partitioning. Algorithm \ref{algorithm:hierarchical_2D_MrSE} shows our hierarchical 2D MrSE minimization algorithm. Instead of simultaneously considering the entire $G'$, Algorithm \ref{algorithm:hierarchical_2D_MrSE} minimizes the MrSE of one subgraph of size $n$ at a time (lines 5-13). After minimizing the MrSE values for all subgraphs, the process continues by treating the clusters formed in the last iteration as nodes to be merged in the subsequent iteration (lines 3-4). Such a process terminates after all nodes are considered simultaneously (lines 14-15). If, at some point, it becomes impossible to merge nodes within any subgraph, we augment the parameter $n$ to encompass a greater number of nodes within the same subgraph (lines 16-17). This adjustment allows for the possibility of merging additional nodes. 

The overall running time complexity of Algorithm \ref{algorithm:hierarchical_2D_MrSE} is reduced from $O(|V||E'|)$
to $O(|V_g||E^{'}_{g}|) < O(n^3)$, where $|V_g| = n$ is the size of one subgraph and $|E^{'}_{g}| < n^2$ is the number of edges in one subgraph.

\begin{algorithm}[t]
%\small
\caption{Hierarchical 2D MrSE minimization.}\label{algorithm:hierarchical_2D_MrSE}
\KwIn{Multi-relational graph $G' = (V, E', R)$, sub-graph size $n$}
\KwOut{An optimal encoding tree $\mathcal{T}$ of height 2}
%Acquire the node and relation stationary distributions $\textbf{x}'$ and $\textbf{y}$ from multi-relational random surfing on $G'$

Initialize $\mathcal{T}$, s.t. for each node $v \in V$, add two nodes, i.e., $\alpha$ and ${\alpha}^-$, to $\mathcal{T}$.  $\alpha$ is a leaf node of $\mathcal{T}$ and $T_\alpha = \{v\}$. ${\alpha}^-$ is the parent of $\alpha$ and $h({\alpha}^-) = 1$

\While{True}{
    $\mathcal{P} \gets (\alpha|\alpha \in \mathcal{T}, h(\alpha) = 1)$

    $\{\mathcal{P}_{g}\} \gets $ consecutively remove the first $min(n, \text{size of the remaining part of } \mathcal{P})$ clusters from $\mathcal{P}$ that form a set $\mathcal{P}_{g}$

    \For{$\mathcal{P}_{g} \in \{\mathcal{P}_{g}\}$}{// minimize the MrSE of one subgraph
    
        $V_{g} \gets$ all graph nodes $v \in V$ that are associated with the clusters in $\mathcal{P}_{g}$
        
        $E_{g}^{'} \gets \{e \in E', \text{both endpoints of }e \in V_{g}\}$
        
        ${G}_{g}^{'} \gets (V_{g}, E_{g}^{'}, R)$

        $\mathcal{T}_{g} \gets$ construct a new encoding tree that contains $\mathcal{P}_{g}$ and the leaf tree nodes of $\mathcal{T}$ that are associated with $\mathcal{P}_{g}$

        $\mathcal{T}_{g}^{'} \gets$ run 2D MrSE minimization (Algorithm \ref{algorithm:2D_MrSE}) on ${G}_{g}^{'}$, with the initial encoding tree set to $\mathcal{T}_{g}$

        $\mathcal{P}_{g}^{'} \gets (\alpha|\alpha \in \mathcal{T}_{g}^{'}, h(\alpha) = 1)$

        Update $\mathcal{T}$ with $\mathcal{P}_{g}^{'}$

    }

    \If{$|\{\mathcal{P}_{g}\}| = 1$}{
        Break
    }
    \If{$\mathcal{P}$ is the same as at the end of last iteration}{
        $n \gets 2n$
    }
}

\Return $\mathcal{T}$
\end{algorithm}


\section{Derivation of Equation (\ref{eq:merge_delta_MrSE})}
\label{app:derivation}
%%=======================
Based on Equation (\ref{eq:MrSE_kD_RS}), merging $\alpha_{o_1}$ and $\alpha_{o_2}$ into $\alpha_{n}$ only affects the MrSE values associated with $\alpha_{o_1}$, $\alpha_{o_2}$, $\alpha_{n}$, and their children. We denote the children of $\alpha_{o_1}$, $\alpha_{o_2}$, and $\alpha_{n}$ as $\Gamma_1 = \{\gamma|\gamma \in \mathcal{T}, \gamma^- = \alpha_{o_1}\}$, $\Gamma_2 = \{\gamma|\gamma \in \mathcal{T}, \gamma^- = \alpha_{o_2}\}$, and $\Gamma_3 = \{\gamma|\gamma \in \mathcal{T}, \gamma^- = \alpha_{n}\} = \Gamma_1 \cup \Gamma_2$, respectively. We have
\begin{equation}
\label{eq:merge_delta_MrSE_de1}
\begin{split}
& \Delta\text{MrSE}_{\alpha_{o_1}, \alpha_{o_2}} = \text{MrSE}_{new} -\text{MrSE}_{old} \\
& = - p'_{\rightarrow\alpha_n}\log{p'_{\alpha_n}} \underbrace{-\underset{\gamma \in \Gamma_3}{\sum}\textbf{x}'_{T_\gamma}\log\frac{\textbf{x}'_{T_\gamma}}{p'_{\alpha_n}}}_{\tcircle{1}}\\
& + p'_{\rightarrow\alpha_{o_1}}\log{p'_{\alpha_{o_1}}} \underbrace{+\underset{\gamma \in \Gamma_1}{\sum}\textbf{x}'_{T_\gamma}\log\frac{\textbf{x}'_{T_\gamma}}{p'_{\alpha_{o_1}}}}_{\tcircle{2}}\\
& + p'_{\rightarrow\alpha_{o_2}}\log{p'_{\alpha_{o_2}}} \underbrace{+\underset{\gamma \in \Gamma_2}{\sum}\textbf{x}'_{T_\gamma}\log\frac{\textbf{x}'_{T_\gamma}}{p'_{\alpha_{o_2}}}}_{\tcircle{3}}.\\
\end{split}
\end{equation}

Further, we have

\begin{equation}
\label{eq:merge_delta_MrSE_de2}
\begin{split}
\tcircle{1} + \tcircle{2} + \tcircle{3} & = - p'_{\alpha_{o_1}}\log\frac{p'_{\alpha_{o_1}}}{p'_{\alpha_n}} - p'_{\alpha_{o_2}}\log\frac{p'_{\alpha_{o_2}}}{p'_{\alpha_n}}.
\end{split}
\end{equation}

Plugging Equation (\ref{eq:merge_delta_MrSE_de2}) into Equation (\ref{eq:merge_delta_MrSE_de1}) concludes the derivation of Equation (\ref{eq:merge_delta_MrSE}).

%%=======================

\section{Node Clustering Experimental Setting}
\label{app:node_clustering_setting}
We adopt a consistent embedding dimension of 64 for all embedding-based methods. For SpectralNet, we adopt a three-layer architecture of [512, 256, 64]. For node2vec and metapath2vec, we set the walk length to 100, context size to 7, walks per node to 5, number of negative samples to 5, and number of workers to 6. For all methods based on deep learning, we configure the learning rate to be 0.001 and the number of training epochs to be 200, incorporating an early stopping mechanism with patience of 50 epochs. 
Given the high density of the ACM and DBLP datasets, we adopt hierarchical 2D minimization (Algorithm \ref{algorithm:hierarchical_2D_MrSE}) for MrSE. This approach is faster compared to the standard 2D minimization (Algorithm \ref{algorithm:2D_MrSE}) when applied to dense graphs. Similarly, for SE and RSSE, we apply the hierarchical 2D minimization proposed by \citet{cao2023hierarchical} instead of the vanilla 2D minimization in \citet{li2016structural} for the ACM and DBLP datasets. We set the sub-graph size $n$ to 800 and 100 for the ACM and DBLP datasets, respectively. 


\section{Node Clustering ACC}
\label{app:node_clustering_acc}
Table \ref{table:node_clustering_acc} presents the node clustering ACC. Our proposed MrSE scores the highest on the ACM and DBLP datasets, outperforming strong baselines including those that leverage node features in addition to graph structure. The MrSE shows suboptimal performance when applied to the extremely sparse IMDB dataset, likely due to the loss of structural information resulting from the stochasticity adjustments. Furthermore, MrSE proves to be a more effective tool in deciphering the community structures of multi-relational graphs compared to SE. This is evident as MrSE outperforms SE on two of three datasets and performs comparably with SE on the third dataset. Meanwhile, some methods with relatively low NMI and ARI achieve high ACC. We believe that this is related to the setting of the expected number of clusters for these methods. Specifically, as discussed in Section \ref{sec:node_clustering}, for representation-learning models including node2vec,
metapath2vec, DGI, and DMGI, we perform k-means clustering after learning representations, setting the number of clusters to the ground truth to obtain final community predictions. We observed that the ACC scores of these methods are sensitive to changes in the number of clusters, while the NMI scores are relatively more stable. For instance, altering the number of clusters to 50 causes the ACC of DGI ($G'$, $X$) to drop from 71.57\% to 13.42\% and its NMI from 47.03\% to 34.74\%. Similarly, the ACC of metapath2vec decreases from 69.72\% to 18.94\%, and its NMI from 47.51\% to 29.13\%.

\begin{table}[t]
\caption{Node clustering ACC (\%). `/' indicates that SpectralNet fails to run on ACM.}
\centering
\scriptsize
\begin{adjustbox}{width=0.85\linewidth}
    \begin{tabular}{c|c|c|c|c}
    \toprule
    %\hline
    Method & Input & ACM & DBLP & IMDB\\
    %\hline
    \midrule
    k-means & $X$ & 31.97 & 28.52 & 28.94 \\
    %\hline
    \midrule
    DMoN & $G$, $X$ & 35.07 & 6.48  & 3.62 \\
    DGI & $G$, $X$ & 35.37 & 50.70 & \textbf{48.50} \\
    %\hline
    \midrule
    SpectralNet & $G$ & / & 11.98  & 19.10 \\
    node2vec & $G$ & 35.21 & 48.73  & 42.24 \\
    SE & $G$ & 44.29 & \underline{68.66} & 6.57 \\
    RSSE (ours) & $G$ & 44.33  & 68.05   & 6.59 \\
    %\hline
    \midrule
    DMoN & $G'$, $X$ & 16.73 & 17.18 & 9.16 \\
    DGI & $G'$, $X$ & \underline{71.57} & 54.47  & 35.25 \\
    DMGI & $G'$, $X$ & 55.64 & 31.23 & 37.82 \\
    %\hline
    \midrule
    SpectralNet & $G'$ & / & 45.45  & 5.80 \\
    metapath2vec & $G'$ & 69.72 & 66.84 & \underline{44.09} \\
    MrSE (ours) & $G'$ & \textbf{77.72} & \textbf{72.70} & 6.81 \\
    %\hline
    \bottomrule
    \end{tabular}
\end{adjustbox}
  \label{table:node_clustering_acc}
\end{table}


\section{Social Event Detection Data Statistics}
\label{app:social_event_data}
Table \ref{table:social_event_data} shows the statistics of the social event detection data. Given that all compared methods are unsupervised and exclusively utilize the test data, we limit our presentation of statistics to the test sets in Table \ref{table:social_event_data}.

% 2012 open
%semantic 0.9731927691307325
%user_id 0.9999476648042112
%user_mention 0.9984586925308646
%hashtags 0.9959854369305929
%entities 0.9616951418470556
%combine 0.9398181774866858

% 2012 close
%len(test_mask):  13769
%semantic 0.9933538776210731
%user_id 0.9999815478682059
%user_mentions 0.999943324841625
%hashtags 0.99908233085894
%entities 0.9837150283572024
%combine 0.977748184970668

% 2018 open
%semantic 0.9878605702182314
%user_name 0.9988066261183788
%user_mention 0.9945346389186838
%hashtags 0.980472136264378
%entities 0.9804642164519546
%combine 0.9476441235031433

% 2018 close
%num_nodes:  12902
%semantic 0.998656054404077
%user_name 0.9996309613757156
%user_mentions 0.9993643446976234
%hashtags 0.9956769623832813
%entities 0.995475903440809
%combine 0.9894792570986826
\begin{table}[t]
\caption{Statistics of the social event detection datasets. M, U, UM, H, N, and S denote message, sender, user mention, hashtag, named entity, and semantic, respectively. `combine' indicates the single-relational edges reduced from the multi-relational ones. The statistics are presented in terms of averages. Detailed data splits can be found in \citet{cao2023hierarchical}.}
\centering
\scriptsize
\begin{adjustbox}{width=0.95\linewidth}
    \begin{tabular}{c|ccccc}
    \toprule
    %\hline
    Dataset & $|G'|$ & $|V|$ & $R$ & Sparsity & $|Y|$ \\
    \midrule
    \multirow{6}{*}{\makecell{Event2012 \\ (avg.)}} & \multirow{6}{*}{21} & \multirow{6}{*}{2,314} & M-U-M & $>$99.99 & \multirow{6}{*}{37} \\
    &  & & M-UM-M & 99.85 &  \\
    &  & & M-H-M & 99.60 &  \\
    &  & & M-N-M & 96.17 &  \\
    &  & & M-S-M & 97.32 &  \\
    &  & & combine & 93.98 &  \\
    \midrule
    \multirow{6}{*}{\makecell{Event2018 \\ (avg.)}} & \multirow{6}{*}{16} & \multirow{6}{*}{3,137} & M-U-M & 99.88 & \multirow{6}{*}{25} \\
    &  & & M-UM-M & 99.45 &  \\
    &  & & M-H-M & 98.05 &  \\
    &  & & M-N-M & 98.05 &  \\
    &  & & M-S-M & 98.79 &  \\
    &  & & combine & 94.76 &  \\
    \bottomrule
    \end{tabular}
\end{adjustbox}
  \label{table:social_event_data}
\end{table}

\section{Multi-relational Message Graph Construction}
\label{app:social_event_graph}
We create multi-relational message graphs by distinguishing message correlations stemming from shared senders, mentioned users, hashtags, named entities, and similar natural language semantics. To achieve this, individual single-relational graphs are constructed for each correlation type. Additionally, a combined graph is formed by consolidating various correlation types into a unified representation. In this consolidation, multiple edges of different correlation types between the same pair of nodes are reduced into a single edge. Following this, all individual single-relational graphs are concatenated along the relation axis, completing the construction of a multi-relational message graph.


\end{document}
