% \documentclass{uai2022} % for initial submission
\documentclass[accepted]{uai2022} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2022} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2022} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Knowledge Representation Combining Quaternion Path Integration
and Depth-wise Atrous Circular Convolution
}

% The standard author block has changed for UAI 2022 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1,2]{\href{mailto:<chen.xinyuan@s.unikl.edu.my>?Subject=Your UAI 2022 paper}{Xinyuan~Chen}{}}
\author[1]{Zhongmei~Zhou}
\author[1]{Meichun~Gao}
\author[1]{Daya~Shi}
\author[2]{Mohd~Nizam~~Husen}

% Add affiliations after the authors
\affil[1]{%
    School of Technology\\
    Fuzhou Technology and Business University\\
    Fuzhou, Fujian, China
}
\affil[2]{%
    Malaysian Institute of Information Technology\\
    Universiti Kuala Lumpur\\
    Kuala Lumpur, Malaysia\\
}
  
  \begin{document}
\maketitle

\begin{abstract}
  Knowledge models endeavor to improve representation and feature extraction capabilities while keeping low computational cost. Firstly, existing embedding models in hypercomplex spaces of non-Abelian group are optimized. Then a method for fast quaternion multiplication is proposed with proof, with which path semantics are computed and further integrated with the attention mechanism based on the idea semantic extraction of relation sequences could be regarded as a multiple rotational blending problem. A depth-wise atrous circular convolution framework is set up for better feature extraction. Experiments including Link Prediction and Path Query are conducted on benchmark datasets verifying our model holds advantages over state-of-the-art models like Rotate3D. Moreover, the model is tested on a biomedical dataset simulating real-world applications. An ablation study is also performed to explore the effectiveness of different components. 
\end{abstract}

\section{INTRODUCTION}\label{sec:intro}

Knowledge Graph (KG) is composed of structured fact triples. Entities in the triples are represented as nodes in the graph and the relations between head and tail entities are represented as edges connecting the nodes. KGs are widely applied in areas such as question answering \citep{hao2017end} and personalized recommendation \citep{guo2020survey}. However, existing KGs are incomplete and contain noise. One of the ideas is to embed entities / relations into low-dimensional vector spaces and apply KG completion techniques to predict missing facts. For example, in RotatE \citep{sun2019rotate} relations are mapped as rotations and the distances between the head vectors after rotations and the tail vectors are utilized to determine whether triple facts are true. However, different KGs contain various proportions of multiple relation modes including Symmetry, Anti-symmetry, Inversion and Composition. Different models are capable of learning representations for different modes and so far there is no perfect embedding solution.

Most of existing embedding models learn representations in the two smallest domains of Divisor Algebra, $\boldsymbol R$ and $\boldsymbol C$. With Quaternion algebra $\boldsymbol H$ and Octonion algebra $\boldsymbol O$ models could develop higher expressivity with less parameters; what's more, the characteristic of non-commutative law in non-Abelian groups helps to model the compositional relation mode. Three-dimensional (3D) and four-dimensional (4D) spatial embeddings with quaternions are adopted in Rotate3D \citep{gao2020rotate3d} and QuatE \citep{zhang2019quaternion} respectively. Subsequent models enhance expressivity by adding entity / relation-specific quaternions or increasing embedding dimensions; larger parameter scales as well as limited feature extraction capabilities leave space for improvement.

Semantic information carried by relation paths between entity pairs helps to determine the validity of triples in knowledge inference. Many models employ frameworks including Recurrent Neural Network (RNN) \citep{jozefowicz2015empirical}, Long Short-Term Memory (LSTM) \citep{greff2016lstm}\citep{zhou2016attention} and Gated Recurrent Unit (GRU) \citep{lu2017bidirectional} to merge vector sequences while the computational efficiency could be further boosted.

ConvE \citep{dettmers2018convolutional} performs 2-dimensional (2D) reshaping on concatenation matrices to enhance interactions and extracts deep non-linear features with Convolutional Neural Network (CNN). InteractE \citep{vashishth2020interacte} adopts an optimized reshaping strategy as well as the circular convolution. In order to capture rich features of complex relations, inspired by neurons with different sizes of receptive fields, Atrous Convolution \citep{chen2017deeplab} expands the fields for larger interaction spaces while maintaining parameter scales. Moreover, introduction of the attention mechanism to integrate features extracted by kernels with various sizes helps to stabilize model performance. The strategies above are jointly applied in our study.

APAC (A Knowledge Representation Model based on the Non- Abelian groups, Path Semantics and Depth-wise Atrous Circular Convolution) is proposed and main work includes: 

1. A hypercomplex embedding model with improved score function and loss function designs is brought forward based on state-of-the-art (SOTA) quaternion models, in which Quaternion algebra, a Hamilton group with the smallest order is employed to learn multiple relational modes between entities. Embedding is also extended to the octonion space.

2. Based on the idea of multi-hop reasoning in Rotate3D, a fast multiplicative calculation method for quaternion sequences is proposed with proof for rapid feature mergers of relational paths which are then integrated with the attention mechanism. 

3. A depth-wise atrous circular convolution framework is set up to enhance the feature extraction capability.

4. Experiments including Link Prediction and Path Query are carried out on benchmark and industry datasets to verify model effectiveness. Ablation study is further performed (see supplementary materials).

\section{RELATED WORK}\label{sec:RelatedWork}
Embedding models could be roughly divided into translation/rotation-based distance models and similarity-based semantic models.

TransE \citep{bordes2013translating}, a distance model, maps the relations to translation vectors. TransE holds that if a triple is valid, the head vector after translation should be close to the tail, denoted as
\begin{equation}
\boldsymbol{h} + \boldsymbol{r} \approx \boldsymbol{t},
\end{equation}
where $\boldsymbol{h},\boldsymbol{r},\boldsymbol{t}$ are the vector representations of the head entity, the relation and the tail entity respectively. L1 / L2 distance between vectors is taken as the score of the triple and a margin-based loss function is applied. With simple structure TransE achieves brilliant performance; however, it lacks the ability to learn symmetric relational representations. Most subsequent models improve by adding dimensions or expanding mapping spaces \citep{wang2014knowledge}\citep{lin2015learning} followed by initiatives employing sparse matrix decomposition \citep{ji2016knowledge} to reduce the number of parameters. RotatE models relations as 2D rotations from head to tail entities in complex spaces with Hadamard product and normalized constraints. The calculation complies with the commutative law. Therefore, RotatE may not perform well on non-commutative relational modes (e.g., Adam's father's wife is not Adam's wife's father).

RESCAL \citep{nickel2015review}, an early semantic model, calculates the factorization of third-order adjacency tensors as triple scores. RESCAL holds strong expressivity but high complexity makes it difficult to train. DistMult \citep{yang2014embedding} represents the relations as diagonal matrices to simplify calculations, capable of learning the symmetric and inverse modes. ComplEx \citep{trouillon2016complex} further extends the embedding to complex spaces to enhance the learning ability for anti-symmetric patterns. Hermitian product is used to calculate triple scores and reduce the number of parameters; however, it is still difficult for ComplEx to learn the non-commutative pattern. \citet{lacroix2018canonical} upgrade ComplEx with L3 regularization and a multi-class log loss.

Since hypercomplex spaces of non-Abelian groups hold strong expressivity and quaternion / octonion calculations are rather efficient, some models expand mapping spaces with them in recent years. Hyperbolic spaces are also taken into consideration \citep{chami2020low}.

QuatE represents relations as rotations in 4D spaces with quaternions to provide more degrees of freedom while avoiding Gimbal Lock. QuatE first calculates the Hamilton product between the head quaternion $\boldsymbol{Q_h}$ and the unit relation quaternion $\boldsymbol{\hat { W } _ { r }}$, and then calculates the inner product of the result with the tail quaternion $\boldsymbol{Q_t}$ so as to obtain the triple score. Compared with real/complex space models, QuatE could also learn symmetric (set the coefficients of imaginary parts to 0), anti-symmetric (conjugate quaternion), and inverse (coefficients set to -1) relation modes while enjoying larger spaces, less parameters and lower computational cost. On such basis, QuatDE \citep{gao2021quatde}, QuatRE \citep{nguyen2020quatre} and DualE \citep{cao2021dual} further enhance the expressivity by increasing dimensions or adding quaternions, though limiting model scalability.

Most models above ignore rich semantic information contained in relation paths \citep{wang2016knowledge}. \citet{lao2011random} generate paths with the Random Walk algorithm and verify path values in knowledge inference. However, early studies take paths as atomic features, leading to huge feature matrices \citep{shang2019end}. \citet{neelakantan2015compositional} and \citet{das2016chains} decompose the paths into relation sequences and input them into RNN, reducing computational cost with parameter sharing. Nonetheless, the possibility that multiple paths to different extents associate with the candidate relations is ignored \citep{xie2017interpretable}. To solve this, \citet{jiang2017attentive} introduces the attention mechanism. Rotate3D models path-based multi-hop reasoning as multiple rotations and in our study a calculation method for fast rotation blending and integration is proposed.

Another vital indicator of knowledge representation models is feature extraction capability besides expressivity and computational overhead. Parameters in CNN are much less than those in fully connected neural networks and are widely employed in Natural Language Processing in recent years. Compared with distance models, 2D convolution in ConvE is able to enhance interactions between entities/relations and extract richer features for embedding learning \citep{balavzevic2019hypernetwork}. However local features are partly lost since ConvE leaves out translational/rotational attributes. \citet{vashishth2020interacte} believe that both the distance and semantic models could only capture shallow features, so they propose Checkered Reshaping and Circular Convolution to improve interactions. On this basis, \citet{wang2021atrous} suggest the multi-size atrous convolution combined with the attention mechanism could bring similar effects. 

Therefore, in our model hypercomplex embedding is employed with optimization. Path semantics are extracted and integrated by fast rotational blending calculation and the attention mechanism. Also a depth-wise atrous circular convolution is defined to facilitate feature extraction.

\section{APAC FRAMEWORK}\label{sec:APAC}
As is shown in Figure~\ref{fig:frame}.
\begin{figure}
  \centering
  \includegraphics[width=0.9\linewidth,page=3]{quacon3}
  \caption{Framework of APAC.}\label{fig:frame}
\end{figure}, APAC learns entity / relation embeddings in hypercomplex spaces. A fast relational path multiplication calculation is designed and the attention mechanism is introduced to integrate path semantics. The feature extraction capability is further enhanced with the depth-wise atrous circular convolution.
\subsection{Hypercomplex Embedding}
Relation representations as rotations of 3D vectors in 3D subspaces of 4D spaces with quaternion embedding could effectively model multiple relational modes. QuatE's inner product score function is mostly used to solve logistic regression problems while distance-based score functions with L1 / L2 norm and margin-based normalized loss functions perform better with noise. Therefore, the Hamilton product between the head 
$ \boldsymbol{Q_h} 
$
and the relation r is firstly calculated, and then the distance between the result and the tail 
$
\boldsymbol{Q_t}
$
is computed. The score function for quaternions is denoted as
\begin{equation}
 \phi ( \boldsymbol { h } , \boldsymbol { r } , \boldsymbol { t } ) = \| \boldsymbol { h } \otimes \boldsymbol { r } - \boldsymbol { t } \|.
\end{equation}

Accordingly, the loss function is defined as
\begin{equation}
\begin{array} { c } L = - \log \sigma ( \gamma - \phi ( \boldsymbol { h } , \boldsymbol { r } , \boldsymbol { t } ) ) - \\ \sum _ { i = 1 } ^ { m } p \left( \boldsymbol { h } _ { \boldsymbol { i } } ^ { \prime } , \boldsymbol { r } , \boldsymbol { t } _ { \boldsymbol { i } } ^ { \prime } \right) \log \sigma \left( \phi \left( \boldsymbol { h } _ { \boldsymbol { i } } ^ { \prime } , \boldsymbol { r } , \boldsymbol { t } _ { \boldsymbol { i } } ^ { \prime } \right) - \gamma \right) + \lambda _ { 1 } \| \boldsymbol { Q } \| _ { 2 } ^ { 2 } + \\ \lambda _ { 2 } \| \boldsymbol { R } \| _ { 2 } ^ { 2 } \end{array},
\end{equation}
consulting the self-adversarial negative sampling in RotatE, where $\sigma$ is the Sigmoid function, $\gamma$ is the margin with a slack coefficient \citep{nayyeri2020let}, $\left( \boldsymbol { h } _ { \boldsymbol { i } } ^ { \prime } , \boldsymbol { r } , \boldsymbol { t } _ { \boldsymbol { i } } ^ { \prime } \right)
$ refers to the $ \boldsymbol { i } $th invalid triple, $\boldsymbol { m } $ is the total number of invalid triples, $\lambda _ { 1 }$,$\lambda _ { 2 }$ are the coefficients for L2 norm entity/relation constraints, $p \left( \boldsymbol { h } _ { \boldsymbol { i } } ^ { \prime } , \boldsymbol { r } , \boldsymbol { t } _ { \boldsymbol { i } } ^ { \prime } \right) = \frac { \exp \beta \mathrm { f } \left( \boldsymbol { h } _ { \boldsymbol { i } } ^ { \prime } , \boldsymbol { t } _ { \boldsymbol { i } } ^ { \prime } \right) } { \sum _ { i = 1 } ^ { m } \exp \beta \mathrm { f } \left( \boldsymbol { h } _ { \boldsymbol { i } } ^ { \prime } , \boldsymbol { t } _ { \boldsymbol { i } } ^ { \prime } \right) }
$ calculates the probability distribution of negative sampling, $ \beta $ is the sampling temperature, and $
\mathrm { f } \left( \boldsymbol { h } _ { \boldsymbol { i } } ^ { \prime } , \boldsymbol { t } _ { \boldsymbol { i } } ^ { \prime } \right) = - \phi \left( \boldsymbol { h } _ { i } ^ { \prime } , \boldsymbol { t } _ { i } ^ { \prime } \right)
$.

Similarly, the score function and loss function for octonions is defined similarly.

Since overall constraints are already imposed on $\boldsymbol { Q }$ and $\boldsymbol { R }$, the unit constraint on the relational quaternions seems unnecessary, compressing the entity rotation spaces and weakening expressivity. L1 normalization helps to generate sparse representations and only retain key features so as to reduce noise interference, but data could be left out on valuable channels. It is found that L2 constraint performs better in experiments.

\subsection{Extraction and Integration of Path Semantics}
Following the path query solution proposed by \citet{gao2020rotate3d} based on multi-hop reasoning, in our study the multi-hop reasoning is regarded as continual Hamilton products of relational sequences taking advantages of the non-commutative characteristic of quaternions. For such calculation, a fast path feature extraction and the integration method is proposed. 

Firstly, multiple paths are generated between entities with Random Walk and encoded as quaternion relation sequences. Length of path is defined as the number of relations in the path. Allow different length but set an upper threshold. Path feature extraction could be taken as multiple rotational blending. The operation space for rotations is non-linear and it is not right to directly add the rotational quaternions together. 

{\bfseries Theorem 1}: For a quaternion sequence, $\boldsymbol { q } _ { 1 } , \boldsymbol { q } _ { 2 } , \ldots \boldsymbol { q } _ { i } , \ldots , \boldsymbol { q } _ { n } , i = 1,2 , \ldots , n$, the continual Hamilton product could be calculated as
\begin{equation}
\boldsymbol { q } _ { \text {result } } = e ^ { \sum _ { i = 1 } ^ { n } \log q _ { i } ^ { \prime k } } = e ^ { \sum _ { i = 1 } ^ { n } k \log q _ { i } ^ { \prime } }.
\end{equation}

Please see Appendix A for Proof and Illustration. Path score is denoted as
\begin{equation}
\Psi _ { \mathrm { i } } ( \boldsymbol { h } , \boldsymbol { t } ) = \left\| \boldsymbol { h } \otimes \boldsymbol { r } _ { 1 } \otimes \boldsymbol { r } _ { 2 } \otimes \ldots \otimes \boldsymbol { r } _ { n } - \boldsymbol { t } \right\| _ { p } , p = 1,2.
\end{equation}

In order to reduce noise and extract key features, the attention mechanism is introduced to integrate path representations. denoted as
\begin{equation}
\Psi _ { \mathrm { i } } ^ { \prime } ( \boldsymbol { h } , \boldsymbol { t } ) = \frac { \exp \left( \phi _ { \mathrm { i } } ( \boldsymbol { h } , \boldsymbol { t } ) \right) } { \sum _ { s } \exp \left( \phi _ { \mathrm { i } } ( \boldsymbol { h } , \boldsymbol { t } ) \right) },
\end{equation}
in which $s$ is the path set. The Softmax function is employed to normalize path scores. Structured representation from path semantics of the tail entity is
\begin{equation}
\boldsymbol { W } _ { t } = \sum _ { s } \psi _ { \mathrm { i } } ^ { \prime } ( \boldsymbol { h } , \boldsymbol { t } ) \circ \boldsymbol { h }.
\end{equation}
\subsection{Depth-wise Atrous Circular Convolution}
In this part the checkered reshaping and the depth-wise atrous circular convolution are combined to improve the model's capability in feature extraction.

The reshaping function is defined as $
\pi : \boldsymbol { H } ^ { k } \times \boldsymbol { H } ^ { k } \rightarrow \boldsymbol { H } ^ { m \times n }$, where $m \times n = 2 k$. The comparison of stacked, alternate and checkered reshaping is shown in Figure~\ref{fig:reshaping}.
\begin{figure}
  \centering
  \includegraphics[width=0.7\linewidth,page=5]{check2}
  \caption{Stacked (left), Alternate (middle) and Checkered Reshaping (right).}\label{fig:reshaping}
\end{figure}. \citet{vashishth2020interacte} argue that entity/relation interactions could be divided into two types, heterogeneous and homogeneous, denoted as $
\mathcal { N } _ { \text {het } } ( \pi , k ) $ and
$\mathcal { N } _ { \text {homo } } ( \pi , k ) $, 
$\mathcal { N } _ { \text {het } } ( \pi , k ) + \mathcal { N } _ { \text {homo } } ( \pi , k ) = 2 \left( \begin{array} { c } k ^ { 2 } \\ 2 \end{array} \right)
$, $
\mathcal { N } _ { \text {het } } \left( \Omega _ { c } ( \pi ) , k \right)
$
 is with greater value in exploring entity/relation association. They prove that the proportion of $\mathcal { N } _ { h e t } ( \pi , k )$ is the highest with the checkered reshaping. Therefore, such strategy is applied. 

Comparison between the ordinary convolution and the circular convolution is shown in Figure~\ref{fig:circular}.
\begin{figure}
  \centering
  \includegraphics[width=0.7\linewidth,page=5]{cir2}
  \caption{Ordinary Convolution (left) and Circular Convolution (right).}\label{fig:circular}
\end{figure}. \citet{vashishth2020interacte} believe
$\mathcal { N } _ { h e t } \left( \Omega _ { c } ( \pi ) , k \right) \geq \mathcal { N } _ { h e t } \left( \Omega _ { 0 } ( \pi ) , k \right)$
where $\Omega _ { c }$ is the circular convolution and $\Omega _ { 0 }$ is the ordinary convolution. The former is employed in our study, defined as
\begin{equation}
[ \boldsymbol { I } \star \boldsymbol { \omega } ] _ { u , t } = \sum _ { i = - \lfloor p / 2 \rfloor } ^ { \lfloor p / 2 \rfloor } \sum _ { j = - \lfloor p / 2 \rfloor } ^ { \lfloor p / 2 \rfloor } \boldsymbol { I } _ { [ u - i ] _ { m } , [ t - j ] _ { n } } \boldsymbol { \omega } _ { i , j },
\end{equation}
where $\boldsymbol { I } \in \boldsymbol { H } ^ { m \times n } , \boldsymbol { \omega } \in \boldsymbol { H } ^ { p \times p } \text { and } \lfloor \cdot \rfloor
$ is the floor function. The depth-wise convolution extracts feature information channel by channel before mergers.

\citet{wang2021atrous} suggest that while single-size kernels benefit from parameter sharing and low computing overhead, their receptive fields are limited. On the contrary, the multi-size circular convolution with the attention mechanism could better extract critical features, so in this study the atrous convolution is employed. with the equivalent kernel size defined as
\begin{equation}
p ^ { \prime } = p + ( p - 1 ) ( \alpha - 1 ),
\end{equation}
where $p$ is the size of a standard kernel, and $\alpha$ is the void rate. Holes are filled with 0, so the receptive field is enlarged with same number of parameters and same computational cost. Convolution kernels with different void rates are shown in Figure~\ref{fig:atrous}.
\begin{figure}
  \centering
  \includegraphics[width=0.7\linewidth,page=5]{atrous}
  \caption{Convolution Kernels with Different Void Rates.}\label{fig:atrous}
\end{figure}. Given the number of kernels of each size C and feature matrix is $\pi \left( \mathcal { P } _ { k } \right)$, features extracted by the $j$th ($j=1,2,…,C$) kernel of the $i$th ($i=1,2,3$) size are denoted as
\begin{equation}
\boldsymbol{V} = \mathrm { f } \left( \pi \left( \mathcal { P } _ { k } \right) \operatorname { conv }( \boldsymbol{\omega _ { i } ^ { j }} )+ \boldsymbol{b _ { i }} \right),
\end{equation}
where $\boldsymbol{b _ { i }}$ is the bias and $
\boldsymbol { V } _ { 1 } , \boldsymbol { V } _ { 2 } , \boldsymbol { V } _ { 3 } \in \boldsymbol { R } ^ { C \times 2 m \times n }
$.

In order to reduce noise and highlight key features, the attention module is introduced to adaptively adjust the weights of features from various kernels.
With convolution the score function is modified, denoted as
\begin{equation}
\begin{array} { l } \phi ( \boldsymbol { h } , \boldsymbol { r } , \boldsymbol { t } ) = \| \operatorname { conv } ( \boldsymbol { h } , \boldsymbol { r } ) \circ ( \boldsymbol { h } \otimes \boldsymbol { r } ) - \\ \operatorname { conv } ( \boldsymbol { r } , \boldsymbol { t } ) \circ \boldsymbol { t } \| \end{array}.
\end{equation}

where $\circ$ denotes the Hadamard product and $\otimes$ denotes the Hamilton product. The loss gradient of entity/relation embeddings could propagate bi-directionally through the convolution or hypercomplex multiplications. 

\section{EXPERIMENTS}\label{sec:Experiments}
\subsection{Link Prediction}
Given an entity and a relation, the missing entity is predicted. The higher the ranking of correct triples in the candidate set are, the stronger the prediction capability of the model is. Mean Reciprocal Rank (MRR) and the proportion of correct entities / triples in the top $N$ candidates (Hits@$N$,$N=1,3,10$) are selected as metrics. The higher the score, the better. Bernoulli method \citep{wang2014knowledge} is adopted to randomly replace entities to create invalid triples. Filtered strategy is employed \citep{bordes2013translating}. Head and tail predictions are regarded as one task and the scores are combined.

Experiments are conducted on three benchmark datasets: WN18RR \citep{dettmers2018convolutional}, FB15k-237 \citep{toutanova2015observed} and YAGO3-10 \citep{mahdisoltani2014yago3}. WN18RR and FB15k-237 remove inverse relations to fix the high-score flaw. The relations in the YAGO3-10 dataset are mostly descriptive attributes about human. Some relations are with hierarchical structure, such as \emph{hypernym} (WN18RR), \emph{part-of} (FB15k-237) and \emph{playsFor} (YAGO3-10). Dataset statistics are shown in Table~\ref{tab:1}, in which the degrees reflect the relational complexity of the datasets \citep{dettmers2018convolutional}.

\begin{table}
    \centering
    \caption{Dataset Statistics for Link Prediction.}\label{tab:1}
    \resizebox{1\columnwidth}{!}{
    \begin{tabular}{ccccccc}
      \toprule % from booktabs package
			\bfseries Dataset & \bfseries Entity & \bfseries Relation & \bfseries Degree & \bfseries Train Set  & \bfseries Val. Set  & \bfseries Test Set  \\
      \midrule % from booktabs package
WN18RR           & 40943           & 11                & 2.2±3.6                             & 86,835             & 3034              & 3134              \\
FB15k-237        & 14541           & 237               & 19.7±30                             & 272,115            & 17535             & 20466             \\
YAGO3-10         & 123182          & 37                & 9.6±8.7                             & 1,079,040          & 5000              & 5000            \\
      \bottomrule % from booktabs package
    \end{tabular}}
\end{table}

The following models are used as baselines:
1. TransE: Results from \citet{ruffinelli2019you}. 2. RotatE: Results from \citet{sun2019rotate}. 3. Rotate3D: Results from \citet{gao2020rotate3d}. 4. DistMult, ComplEx and ConvE: Results from \citet{dettmers2018convolutional}. 5. ComplEx-N3: Results from \citet{lacroix2018canonical}. 6. QuatE: Results from \citet{zhang2019quaternion}. We also make our own implementation and run on YAGO3-10, etc. (Codes are released on https://gitee.com/tkgc/APAC.) 7. ROTE/ATTE: embedding models in hyperbolic spaces, ATTE combining rotation and reflection while ROTE only containing rotation. Results from \citet{chami2020low}. 8. TuckER: a SOTA semantic model with TuckER Decomposition. Results from \citet{balavzevic2019tucker}. 9. CoKE: A SOTA path model employing Transformer to encode semantics. Results from \citet{wang2019coke}. 

The Training details are in Appendix B. Results are shown in Table~\ref{tab:2} taking a 5-time average. Results in bold indicate the best performance while those in italics are the second. $\rm{APAC_q}$ and $\rm{APAC_o}$ denote quaternion and octonion embedding respectively. It is obvious that the overall performance of APAC is better than mainstream models and $\rm{APAC_q}$ is significantly better than $\rm{APAC_o}$ at many indicators. 

\begin{table*}
    \centering
    \caption{Link Prediction Results.}\label{tab:2}
\resizebox{2\columnwidth}{!}{
\begin{tabular}{ccccccccccccc}
      \toprule % from booktabs package
\bfseries Model & \multicolumn{4}{c}{\bfseries WN18RR}     & \multicolumn{4}{c}{\bfseries FB15k-237}  & \multicolumn{4}{c}{\bfseries YAGO3-10}   \\
                       & \bfseries MRR   & \bfseries Hits@1 & \bfseries 3     & \bfseries 10    & \bfseries MRR   & \bfseries Hits@1 & \bfseries 3     & \bfseries 10    & \bfseries MRR   & \bfseries Hits@1 & \bfseries 3     & \bfseries 10    \\
      \midrule % from booktabs package
TransE     & 0.228          & -              & -              & 0.520          & 0.313          & -              & -              & 0.497          & -              & -              & -              & -              \\
RotatE     & 0.476          & 0.428          & 0.492          & 0.571          & 0.338          & 0.241          & 0.375          & 0.533          & 0.495          & 0.402          & 0.550          & 0.670          \\
Rotate3D   & \emph{0.489} & 0.442          & \bfseries{0.505} & \bfseries{0.579} & 0.347          & 0.250          & 0.385          & 0.543          & -              & -              & -              & -              \\
DistMult   & 0.430          & 0.390          & 0.440          & 0.490          & 0.241          & 0.155          & 0.263          & 0.419          & 0.340          & 0.240          & 0.380          & 0.540          \\
ComplEx    & 0.440          & 0.410          & 0.460          & 0.510          & 0.247          & 0.158          & 0.275          & 0.428          & 0.360          & 0.260          & 0.400          & 0.550          \\
ComplEx-N3 & 0.470          & -              & -              & 0.540          & 0.350          & -              & -              & 0.540          & 0.490          & -              & -              & \emph{0.680} \\
QuatE      & 0.482          & 0.436          & \emph{0.499} & \emph{0.572} & \emph{0.366} & 0.271          & \emph{0.401} & \bfseries{0.556} & 0.502          & 0.428          & 0.543          & 0.674          \\
ROTE       & 0.463          & 0.426          & 0.477          & 0.529          & 0.307          & 0.220          & 0.337          & 0.482          & 0.381          & 0.295          & 0.417          & 0.548          \\
ATTE       & 0.456          & 0.419          & 0.471          & 0.526          & 0.311          & 0.223          & 0.339          & 0.488          & 0.374          & 0.290          & 0.410          & 0.537          \\
TuckER     & 0.470          & 0.443          & 0.482          & 0.526          & 0.358          & 0.266          & 0.394          & 0.544          & -              & -              & -              & -              \\
CoKE       & 0.484          & \bfseries{0.450} & 0.496          & 0.553          & 0.364          & \emph{0.272} & 0.400          & \emph{0.549} & -              & -              & -              & -              \\
ConvE      & 0.460          & 0.390          & 0.430          & 0.480          & 0.316          & 0.239          & 0.350          & 0.491          & \emph{0.520} & \emph{0.450} & \bfseries{0.560} & 0.660          \\
$\rm{APAC_q}$      & \bfseries{0.501} & \emph{0.447} & 0.487          & 0.535          & \bfseries{0.378} & \bfseries{0.280} & \bfseries{0.407} & 0.548          & 0.518          & \bfseries{0.461} & \emph{0.558} & \bfseries{0.696} \\
$\rm{APAC_o}$      & 0.479          & 0.435          & 0.488          & 0.539          & 0.353          & 0.269          & 0.384          & 0.511          & \bfseries{0.527} & 0.422          & 0.546          & 0.620          \\
      \bottomrule % from booktabs package
    \end{tabular}}
\end{table*}

On WN18RR, the simplest dataset, Rotate3D and QuatE achieve best results while $\rm{APAC_q}$ secures the highest MRR and good Hits@1 Score. On the complex dataset FB15k-237, the advantages of $\rm{APAC_q}$ are more clear with highest MRR and Hits@1, 3. the Hits@1 score is 3.3\% higher than that of QuatE while the Hits@10 score is catching up. On Yago3-10, the largest dataset, there is no comparison with Rotate3D (no available code), but $\rm{APAC_q}$ achieves highest or close to highest scores at all metrics, Hits@1 score 7.7\% higher than that of QuatE and Hits@10 score higher than that of QuatE. $\rm{APAC_o}$ performs well at MRR and Hits@3. Comparison with Rotate3D and QuatE demonstrates that APAC holds strong learning ability for complex relational patterns which may come from the integration of path semantics or the depth-wise atrous circular convolution combined with the attention mechanism. Compared with TransE, DistMult, ComplEx, ROTE, ATTE and TuckER, $\rm{APAC_q}$ and $\rm{APAC_o}$ perform better indicating the effectiveness of hypercomplex embedding. $\rm{APAC_q}$ also holds certain advantages over CoKE, the SOTA context semantics model, verifying the forces of hypercomplex embedding and feature extraction methods. ConvE does not perform great due to the simple reshaping strategy. Follow-up experiments focus on $\rm{APAC_q}$.

\subsection{Path Query}
To verify model capabilities for modeling the composition pattern path query (multi-hop reasoning) is carried out following Rotate3D. Given the starting entity $\boldsymbol{h}$ and the path $\boldsymbol{p}$, entities that $\boldsymbol{h}$ can reach via $\boldsymbol{p}$ are predicted and ranked. Two datasets provided by \citet{guu2015traversing} are employed, coming from WordNet and Freebase respectively. Dataset statistics are shown in Table~\ref{tab:3}. The same settings including the negative sampling and filtering strategy by \citet{gao2020rotate3d} are adopted. The average quantile (MQ) and Hits@10 are used as metrics. The higher the score, the better. Two training strategies are employed: only triples (denoted as Single) and all paths (Comp). 

\begin{table*}
    \centering
    \caption{Dataset Statistics for Path Query.}\label{tab:3}

\begin{tabular}{ccccccccc}
      \toprule % from booktabs package
\bfseries Dataset  & \bfseries Entity & \bfseries Relation & \bfseries Train Set & \bfseries Val. Set & \bfseries Test Set & \bfseries Train Paths & \bfseries Val. Paths & \bfseries Test Paths \\
      \midrule % from booktabs package
WordNet  & 38551  & 11       & 110,361   & 2602     & 10462    & 2129539     & 11277      & 56477      \\
Freebase & 75043  & 13       & 316,232   & 5908     & 23733    & 6266058     & 27163      & 109557    \\
      \bottomrule % from booktabs package
    \end{tabular}
\end{table*}

Best performance is achieved when the embedding dimension $d=500$, batch size 512, learning rate $lr=0.0005$, margin $\gamma=8$ and other parameters same as on YAGO3-10.

Compare APAC with Bilinear \citep{guu2015traversing}, TransE, CoKE, RotatE and Rotate3D under the Single strategy, with ROP \citep{yin2018recurrent}, using RNN to model paths), CoKE, RotatE and Rotate3D under the Comp strategy. Relevant results are from \citet{gao2020rotate3d} and the results are shown in Table~\ref{tab:4}, Table~\ref{tab:5}. It can be seen that except for the MQ score on WordNet, $\rm{APAC_q}$-Single wins on all metrics in its group. $\rm{APAC_q}$-Comp also achieves the highest or second highest scores for each indicator, and scores higher than $\rm{APAC_q}$-Single, which is another proof $\rm{APAC_q}$ possesses the learning ability for the composition pattern and the ability to integrate path semantics. CoKE relies heavily on contexts, so it's a draw for CoKE and $\rm{APAC_q}$-Comp under the Comp strategy. However, $\rm{APAC_q}$-Comp performs much better under the Single strategy.

\begin{table}
    \centering
    \caption{Path Query Results (Triple Training).}\label{tab:4}

\begin{tabular}{ccccc}
      \toprule % from booktabs package
\bfseries{Model}  & \multicolumn{2}{c}{\bfseries{WordNet}}        & \multicolumn{2}{c}{\bfseries{Freebase}}       \\
\bfseries{}       & \bfseries{MQ}          & \bfseries{Hits@10}     & \bfseries{MQ}          & \bfseries{Hits@10}     \\
      \midrule % from booktabs package
Bilinear-Single & 0.847                & 0.436                & 0.580                & 0.259                \\
TransE-Single   & 0.837                & 0.138                & 0.862                & 0.454                \\
CoKE-Single     & 0.731                & 0.157                & 0.730                & 0.367                \\
RotatE-Single   & {\emph{0.937}} & 0.479                & 0.833                & 0.453                \\
Rotate3D-Single & \bfseries{0.941}       & {\emph{0.494}} & {\emph{0.894}} & {\emph{0.547}} \\
$\rm{APAC_q}$-Single    & 0.932                & \bfseries{0.502}       & \bfseries{0.904}       & \bfseries{0.583}   \\
      \bottomrule % from booktabs package
    \end{tabular}
\end{table}

\begin{table}
    \centering
    \caption{Path Query Results (Path Training).}\label{tab:5}

\begin{tabular}{ccccc}
      \toprule % from booktabs package
\bfseries{Model}  & \multicolumn{2}{c}{\bfseries{WordNet}}        & \multicolumn{2}{c}{\bfseries{Freebase}}       \\
\bfseries{}       & \bfseries{MQ}          & \bfseries{Hits@10}     & \bfseries{MQ}          & \bfseries{Hits@10}     \\
      \midrule % from booktabs package
ROP-Comp       & -                & -                & 0.907             & 0.567            \\
CoKE-Comp      & 0.942            & \emph{0.674}   & \bfseries{0.948}    & \bfseries{0.764}   \\
RotatE-Comp    & 0.947            & 0.653            & 0.901             & 0.601            \\
Rotate3D-Comp  & \emph{0.949}   & 0.671            & 0.905             & 0.621            \\
$\rm{APAC_q}$-Comp     & \bfseries{0.960}   & \bfseries{0.719}   & \emph{0.933}    & \emph{0.723}  \\
      \bottomrule % from booktabs package
    \end{tabular}
\end{table}

\subsection{Application on Industry Dataset}
Domain-specific KGs are helpful for promoting knowledge application and the industry development. Apply our model to a biomedical dataset ogbl-biokg \footnote{https://ogb.stanford.edu/docs/linkprop/\#ogbl-biokg} containing 5 entity types including diseases, drugs, side effects, proteins and their functions as well as 51 relation types. The statistics is shown in Table~\ref{tab:6}. ogbl-biokg is collected from diversified sources with complex relation modes and broad confidence differences for facts, challenging models for extracting relation features and modeling knowledge uncertainty (another future plan). Random divisions of the train/val./test sets are made with proportions 94\%, 3\% and 3\% respectively. Since the entity relations are rather dense and simply replacing the head or tail entity probably brings false negatives, replace the head and the tail entities at the same time to generate invalid samples with the ratio set to 1:1. 

\begin{table*}
    \centering
    \caption{Dataset Statistics for ogbl-biokg.}\label{tab:6}
\resizebox{2\columnwidth}{!}{
\begin{tabular}{cccccccccc}
      \toprule % from booktabs package
\bfseries{Dataset} & \bfseries{Diseases} & \bfseries{Drugs} & \bfseries{Side Effects} & \bfseries{Proteins} & \bfseries{Functions} & \bfseries{Total Entities} & \bfseries{Train Set} & \bfseries{Val. Set} & \bfseries{Test Set} \\
      \midrule % from booktabs package
ogbl-biokg       & 10687             & 10533          & 9969                  & 17499             & 45085              & 93773                   & 4.76M              & 162k              & 162k              \\
                 & 11.40\%           & 11.23\%        & 10.63\%               & 18.66\%           & 48.08\%            & 100\%                   & 94\%               & 3\%               & 3\%   \\
      \bottomrule % from booktabs package
    \end{tabular}}
\end{table*}

Best performance is achieved when the embedding dimension $d=500, 1000$, batch size 512 and the learning rate $lr=0.0001$, other parameters same as on YAGO3-10. 

Compare $\rm{APAC_q}$ with TransE, DistMult, RotatE, QuatE, PairRE \citep{chao2020pairre} and AutoSF+ \citep{zhang2021autosf+}. The latter two are specially designed for modeling complex relations. PariRE introduces relationship-specific pair vectors for representation while in AutoSF+ an adaptive score function is proposed, pruning the search spaces with filters and predictors and replacing the greedy algorithm in AutoSF \citep{zhang2020autosf}with Evolutionary Search. 

Results are shown in Table~\ref{tab:7}. $\rm{APAC_q}$ performs best with slight differences between dimensions 500 and 1000, showing strong feature extraction capability under low dimensions. Translation / rotation models that only extract shallow features do not perform great and increasing dimensions does not bring obvious improvement. A similar situation occurs with QuatE. Compared with the semantic models, $\rm{APAC_q}$ holds obvious advantages. Compared with PariRE and AutoSF+, $\rm{APAC_q}$ upgrades performance without adding embedding or expanding the search spaces, which verifies the effectiveness of path semantic integration and our convolution framework. 

\begin{table}
    \centering
    \caption{Dimensions and MRR on ogbl-biokg.}\label{tab:7}
\begin{tabular}{ccc}
      \toprule % from booktabs package
\bfseries{Model} & \bfseries{Dimension} & \bfseries{MRR}     \\
      \midrule % from booktabs package
TransE         & 2000               & 0.7452            \\
RotatE         & 1000               & 0.7989            \\
DistMult       & 2000               & 0.8043            \\
ComplEx        & 1000               & 0.8095            \\
QuatE          & 500                & 0.7712            \\
QuatE          & 1000               & 0.7954            \\
PairRE         & 2000               & 0.8164            \\
AutoSF+        & 1000               & 0.8309     \\
AutoSF+        & 2000               & 0.8320     \\
$\rm{APAC_q}$          & 500                & \emph{0.8526}   \\
$\rm{APAC_q}$          & 1000               & \bfseries{0.8578}  \\
      \bottomrule % from booktabs package
    \end{tabular}
\end{table}

It is worth noting that although additional path calculation and convolution operations are introduced, with the phased parallel training strategy, it only takes 1.5 hours for $\rm{APAC_q}$ to surpass QuatE's best performance under dimension 500, which is only about 1/5 of the training time for the latter, indicating that our model improves feature extraction capability while boosting computing efficiency. 

\section{CONCLUSION}\label{sec:Conclusion}
Compared with embedding models in real / complex spaces, a knowledge representation model with stronger expressivity and feature extraction capabilities is proposed with hypercomplex embedding, path semantic integration combining fast quaternion rotation blending and the attention mechanism, and the depth-wise atrous circular convolution. Low computational cost of quaternion multiplication, parameter sharing in CNN and phased parallel training strategy ensure rapid taking effect of the model on large datasets. Future work includes further improving on learning complex composition modes, applying Kronecker product to expand embedding spaces with high efficiency \citep{zhang2021beyond} and modeling knowledge uncertainty / time validity, etc.

\appendix
% NOTE: necessary when ptmx or no mathfont class option is given
\providecommand{\upGamma}{\Gamma}
\providecommand{\uppi}{\pi}
\section{Proof and Illustration of Theorem 1}
Proof.
For $
\boldsymbol { q } _ { 1 } \otimes \boldsymbol { q } _ { 2 } $
$
\boldsymbol { q } _ { 2 } \otimes \boldsymbol { q } _ { 1 }
$, bisect the angles respectively, $
\boldsymbol { q } _ { 1 } ^ { \prime } = \boldsymbol { q } _ { 1 } ^ { \frac { 1 } { 2 } } , \boldsymbol { q } _ { 2 } ^ { \prime } = \boldsymbol { q } _ { 2 } ^ { \frac { 1 } { 2 } }
$, resulting in $
\boldsymbol { q } _ { 1 } ^ { \frac { 1 } { 2 } } * \boldsymbol { q } _ { 2 } ^ { \frac { 1 } { 2 } } * \boldsymbol { q } _ { 1 } ^ { \frac { 1 } { 2 } } * \boldsymbol { q } _ { 2 } ^ { \frac { 1 } { 2 } }
$ and $
\boldsymbol { q } _ { 2 } ^ { \frac { 1 } { 2 } } * \boldsymbol { q } _ { 1 } ^ { \frac { 1 } { 2 } } * \boldsymbol { q } _ { 2 } ^ { \frac { 1 } { 2 } } * \boldsymbol { q } _ { 1 } ^ { \frac { 1 } { 2 } }
$. Parts of the two calculations are the same. Further $k$-sect $\boldsymbol { q } _ { 1 } , \boldsymbol { q } _ { 2 }$. The greater the value of $k$ is, the higher the proportion of same calculations is. When $k \rightarrow \infty$, the middle parts of the two calculations converge while the head and tail parts tend to be close to unit quaternions with weakening influence, leading to stable results. According to \citet{alexa2002linear}, the limit exists, so we have
\begin{equation}
\lim _ { k \rightarrow \infty } \left( \boldsymbol { q } _ { \mathbf { 1 } } ^ { \frac { 1 } { k } } * \boldsymbol { q } _ { 2 } ^ { \frac { 1 } { k } } \right) ^ { k } = \lim _ { k \rightarrow \infty } \left( \boldsymbol { q } _ { 2 } ^ { \frac { 1 } { k } } * \boldsymbol { q } _ { \mathbf { 1 } } ^ { \frac { 1 } { k } } \right) ^ { k }.
\end{equation}
For the Trotter product formula
\begin{equation}
e ^ { A + B } = \lim _ { N \rightarrow \infty } \left( e ^ { \frac { A } { N } } * e ^ { \frac { B } { N } } \right) ^ { N },
\end{equation}
replace  
$
e ^ { \frac { A } { N } } , e ^ { \frac { B } { N } }$ with $ \boldsymbol { q } _ { \mathbf { 1 } } ^ { \frac { 1 } { k } }, \boldsymbol { q } _ { \mathbf { 2 } } ^ { \frac { 1 } { k } }.$ With $e ^ { \log \boldsymbol{q} } = \boldsymbol{q},$
we have \begin{equation}
\lim _ { n \rightarrow \infty } \left( \boldsymbol { q } _ { \mathbf { 1 } } ^ { \frac { 1 } { k } } \otimes \boldsymbol { q } _ { \mathbf { 2 } } ^ { \frac { 1 } { k } } \right) ^ { k } = e ^ { \log \boldsymbol{q _ { 1 }} + \log \boldsymbol{q _ { 2 }} }
\end{equation}

(The formal proof of isomorphism between quaternions and matrices is saved for future work). It can be seen that the limit operation is equivalent to find the sum of the logarithms of the two quaternions and compute the exponentiation of the result. The calculation cost is constant. Further extend such operation to a quaternion sequence and we get (4).
When the special case of coaxial rotational blending (on the same plane) occurs, the result could be seen as a quaternion formed by adding all rotational angles together. 

Illustration.
Rotations represented by quaternions $q _ { 1 } , q _ { 2 }$ under different $k$-sections are shown in Figure~\ref{fig:rotation}.
\begin{figure}
  \centering
  \includegraphics[width=1\linewidth,page=10]{rotation}
  \caption{Rotations under Different $k$-sections.}\label{fig:rotation}
\end{figure}. The axes in 3D spaces are in red, blue and green respectively and the black arrow is the vector calculated by the axis and the angle. It can be seen that when $k=8$, the result is close to the result when $k \rightarrow \infty$.

\section{Model Training Details}
Experiments are conducted on a Lenovo SR590 server with the hardware configuration including 20 core Xeon * 2 (CPU), 16G * 8 Memory, 1.2TB * 3 SAS disks (in RAID5 mode) and Tesla P100 * 2 (computing cards).

Adam optimizer (Adaptive Estimates of Lower-Order Moments) is adopted and optimal parameters are determined with Grid Search. The hyperparameter pool and optimal parameters could be found on the project homepage. Dropouts are added before and after convolution and after the full connection layer, numbered 1, 2 and 3 respectively. Following Gao et al. (2020), relation-specific biases are applied. The batch normalization strategy is employed to reduce the scaling effect caused by hypercomplex multiplications and control the normalizing rate. It is found that with the batch normalization models converge faster and perform more stable than with the unit quaternion. Early stop strategy is activated when the MRR increase in the last 10 epochs on Val. Set is less than 10-2. In other experiments the optimizer and training strategies are the same unless declared differently.

\begin{contributions} % will be removed in pdf for initial submission,
                      % so you can already fill it to test with the
                      % ‘accepted’ class option

Ms. Gao and Ms. Shi help with theretical deduction and verification of the combination of the atrous and circular convolution. Ms. Zhou and Mr. Husen oversee the overall framework and project implementation.

\end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
                         % so you can already fill it to test with the
                         % ‘accepted’ class option

The study is supported by the provincial research project in Fujian (FJJKCG20-402).
\end{acknowledgements}

\bibliography{chen_271}

\end{document}
