\documentclass{article} % For LaTeX2e
\usepackage{iclr2024_conference,times}

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}

\usepackage{hyperref}
\usepackage{url}
\usepackage{graphicx}
\usepackage{float}
\usepackage{algorithm}  
\usepackage{algorithmicx}
\usepackage{algpseudocode}
\usepackage{natbib}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage{booktabs}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{color}


\DeclareCaptionFormat{myformat}{\raggedright#1#2#3\par}
\captionsetup[algorithm]{format=myformat}


\algnotext{EndFor}
\algnotext{EndIf}
\algnotext{EndProcedure}

\title{OrthCaps: An Orthogonal CapsNet with \\Sparse Attention Routing and Pruning} %and Capsule ReLU}

% Authors must not appear in the submitted version. They should be hidden
% as long as the \iclrfinalcopy macro remains commented out below.
% Non-anonymous submissions will be rejected without review.

\author{Xinyu Geng \\
Department of Mechanical Engineering and Automation \\
Harbin Institute of Technology\\
Shenzhen, Shenzhen 518055, China \\
\texttt{22S153095@stu.hit.edu.cn} \\
\And
Ji Q. Ren \& Yevgeny LeNet \\
Department of Computational Neuroscience \\
University of the Witwatersrand \\
Joburg, South Africa \\
\texttt{\{robot,net\}@wits.ac.za} \\
\AND
Jun Xu \\
Department of Mechanical Engineering and Automation \\
Harbin Institute of Technology\\
Shenzhen, Shenzhen 518055, China \\
\texttt{email}
}

% The \author macro works with any number of authors. There are two commands
% used to separate the names and addresses of multiple authors: \And and \AND.
%
% Using \And between authors leaves it to \LaTeX{} to determine where to break
% the lines. Using \AND forces a linebreak at that point. So, if \LaTeX{}
% puts 3 of 4 author names on the first line, and the last on the second
% line, try using \AND instead of \And before the third author name.

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

%\iclrfinalcopy % Uncomment for camera-ready version, but NOT for submission.
\begin{document}
\hyphenpenalty=2000


\maketitle

\begin{abstract} 
   Redundancy is a persistent challenge in Capsule Networks (CapsNet),
   leading to high computational costs and parameter counts \citep{jeong2019ladder,sharifi2021prunedcaps,renzulli2022towards}. 
   Although previous works have introduced pruning after the initial capsule layer,
   dynamic routing's iterative and fully connected nature reintroduces inefficiencies and redundancy in deeper layers. 
   % Considering that orthogonality has been applied in Convolutional Neural Networks (CNNs) to minimize filter similarity,
   % we introduce orthogonality to improve the routing algorithm.
   In this paper, 
   we propose the Orthogonal Capsule Network (OrthCaps) to reduce redundancy, improve routing performance and decrease parameter count.
   Specifically, an efficient pruned capsule layer is placed to discard redundant capsules and dynamic routing is replaced with orthogonal sparse attention routing. 
   Besides, we orthogonalize weight matrices during routing to ensure feature diversity and sustain low capsule similarity,
   the idea of which is inspired by the application of orthogonality in Convolutional Neural Networks (CNNs).
   Moreover, a novel activation function named Capsule ReLU is proposed to address vanishing gradient.
   Our experiments on baseline datasets affirm the efficiency and robustness of OrthCaps in classification tasks, 
   in which ablation studies validate the criticality of each component. 
   Remarkably, with only 110k parameters, merely 1.25\% of a standard Capsule Network's total, OrthCaps-Shallow outperforms state-of-the-art (SOTA) benchmarks on four datasets,
   while OrthCaps-Deep attains nearly SOTA accuracy with 1.2\% of its parameters on four datasets. 
   The code is available at \href{https://github.com/ornamentt/Orthogonal-Capsnet}{ornamentt/Orthogonal-Capsnet (github.com)}.
   %Dynamic routing in Capsule Networks requires iterations until convergence and is essentially a fully connected structure,
   %inducing computational inefficiency and redundancy.
   %increasing parameters in shallow architecture and restricting the development of deep architecture.  
\end{abstract}

\vspace*{-15pt}

\section{Introduction}\label{1}

CapsNet replaces neurons with capsule vectors, 
where the capsule length denotes the existence probability of entities in the image,
and its direction indicates captured features\citep{sabour2017dynamic}.
Thus, a high similarity of the two capsules' directions implies they extract analogous features. 
Recent studies have mentioned that Capsnet contains redundant capsules \citep{chen2022fast,sharifi2021prunedcaps,renzulli2022towards}.
As evidence, Figure (\ref{similarities}) shows 48.2\% of primary capsule pairs exhibit cosine similarities above 0.65, indicating significant redundancy.

While some works have employed pruning techniques at the primary capsule layer\citep{renzulli2022rem}, 
deeper layers continue to exhibit high similarity issues, as illustrated in Figure (\ref{pruned}). 
This deep redundancy is primarily a result of dynamic routing. 
In this mechanism, every lower-level capsule connects to all higher-level ones.
This full connection structure leads to potential transmission of redundant information. 
Furthermore, weight matrices in routing can shift capsule directions, 
increasing the reduced capsule similarity after pruning and causing feature overlap.
Such overlap not only impairs routing performance but also reintroduces redundancies in subsequent layers.
This means that despite the initial pruning, new redundancies emerge in subsequent layers due to dynamic routing.
%Applying pruned layers after each dynamic routing would surge parameter counts and computational costs. 
Additionally, dynamic routing requires multiple iterations and repeatedly updating coupling coefficients until convergence, 
further straining computational resources.

Considering these challenges, and inspired by the introduction of orthogonality in CNNs to reduce filter overlaps, 
we integrate orthogonality into capsule networks. 
Our proposed solution, the Orthogonal Capsule Network (OrthCaps), 
addresses the iterative convergence, directional shifts in capsule vectors, and the fully connected structure of dynamic routing. 
We present two versions of OrthCaps: a lightweight model, OrthCaps-Shallow \textbf{(OrthCaps-S)}, and an efficient deep model, OrthCaps-Deep \textbf{(OrthCaps-D)}.


% Capsule Neural Network (CapsNet)\citep{sabour2017dynamic} was introduced to improve robustness and identify spatial feature correlation. 
% Dynamic routing-by-agreement is a crucial design in CapsNet that aligns lower-level capsules with their higher-level counterparts. 
% However, it needs multiple iterations and repeatedly updating coupling coefficients until convergence, 
% increasing computational demand and resource waste. Moreover, dynamic routing is essentially a full connection from the lower-level capsules to the higher-level capsules, 
% meaning every lower-level capsule is connected to every higher-level capsule,  
% leading to non-essential information and capsule redundancy.
% As depicted in Figure \ref{similarities}, capsule networks exhibit both high correlation and redundancy among capsules.

% \textcolor{blue}{Previous research has focused on pruning or refining routing mechanisms, but challenges persist. 
% \citep{choi2019attention,mazzia2021efficient} enhances attention routing to solve iteration problems, 
% but redundancy is neglected. T
% Conversely, \citep{jeong2019ladder,chen2022fast} applies to prune but overlooks the new redundancies introduced by fully-connected dynamic routing.
% Furthermore, Pruning offers the advantage of removing redundant capsules, improving capsule independence.
% This independence allows capsules to extract diverse features without overlap. 
% However, weight matrices in routing can shift capsule vector directions, 
% undermining inter-capsule correlation and leading to feature interference.
% This may impair routing performance and reintroduce redundancies in subsequent layers.}
% \textcolor{blue}{We also introduce sparsity to attention routing, curtailing the routing of noise features to higher-level capsules.}
% \textcolor{blue}{In this paper, we propose a non-iterative, sparse routing algorithm and introduce orthogonality to maintain low inter-capsule correlation and minimize the number of parameters. 
% The novel capsule network architecture, OrthCaps,}
% includes two versions: a lightweight model OrthCaps-Shallow \textbf{(OrthCaps-S)} and an efficient deep model OrthCaps-Deep \textbf{(OrthCaps-D)}, each tailored for tasks of diverse complexities. 

\begin{figure}[h]
   \centering
      \begin{minipage}{0.4\textwidth} 
         \vspace*{-10pt}
         Firstly, we introduce a pruned capsule layer following the primary capsule layer,
         This layer eliminates redundant capsules, retaining only the essential and representative ones. 
         The importance of capsules is gauged by $ L_2$-norm, as it reflects the existence probability of the entities they represent. 
         Given that a capsule's direction signifies the features it extracts,
         we measure their correlation using cosine similarity 
         and employ broadcasting and matrix multiplication for algorithmic efficiency.
      \end{minipage}%
      \hspace{0.01\textwidth}
      \begin{minipage}{0.58\textwidth}
         \vspace*{-13pt}
         \centering
         \includegraphics[scale=0.22]{./figs/concatenated_image.png}
         \caption{\textbf{Left:} In CapsNets primary capsule layer, 48.2\% of capsule pairs have cosine similarities greater than 0.65, indicating significant redundancy among capsules.
         \textbf{Right:} After introducing the Pruned Layer, capsule similarities effectively decrease.
         (Detailed in Section 3.2)}
         \label{similarities}
      \end{minipage}
\vspace*{-18pt}
\end{figure}


Secondly, dynamic routing is replaced with attention routing, eliminating the need for iteration.
For solving the fully-connected problem, We leverage sparsemax-based self-attention to produce an attention map,  
which selectively amplifies relevant feature groups corresponding to specific entities while downplaying irrelevant ones.
For OrthCaps-S, a simplified attention-routing mechanism is adopted, optimizing parameter count and computational demands.

Thirdly, to address the issue of capsule vector direction shifts, we introduce orthogonality into capsule networks.
An orthogonal weight matrix preserves the direction of capsule vectors, 
thus mitigating feature interference.
Utilizing Householder orthogonal decomposition, 
we enforce orthogonality in the weight matrices during attention routing,
which sustains low inter-capsule correlation and enriches feature diversity.


% \vspace*{-27pt}
% \textcolor{blue}{Secondly, we replace dynamic routing with attention routing, thereby eliminating iteration operation.
% Utilizing Householder orthogonal decomposition, 
% we enforce orthogonality in the weight matrices during attention routing,
% which sustains low inter-capsule correlation and enriches feature diversity.
% Sparsemax-based self-attention is applied to derive an attention map, 
% which selectively amplifies relevant feature groups corresponding to specific object representations while diminishing the impact of irrelevant ones.
% For OrthCaps-S, we implement a simplified attention-routing mechanism that minimizes both the number of parameters and computational load.}

% We replace dynamic routing with attention routing, thereby eliminating iteration operation. 
% i.e., each lower-level capsule is transformed into key-query-value tuples and sparsemax-based self-attention is applied to derive an attention map.
% %which equals dynamic routing coefficients. 
% Subsequently, capsule activations are computed, 
% which are employed to selectively amplify relevant feature groups corresponding to specific object representations, 
% while diminishing the impact of irrelevant ones.
% For OrthCaps-S, we implement a simplified attention-routing mechanism that minimizes both the number of parameters and computational load.

% Secondly, to reduce redundancy, we introduce orthogonality into capsule layers. 
% In order to encourage lower inter-capsule correlation, fewer capsules are required to represent a broader range of features.
% A pruned capsule layer is added subsequent to the primary capsule layer,
% eliminating highly similar capsules while retaining those that are crucial and representative.
% Utilizing Householder orthogonal decomposition, 
% we enforce orthogonality in the weight matrices during attention routing,
% which sustains low inter-capsule correlation and enriches feature diversity.


Lastly, we propose an activation function called Capsule ReLU, tailored for deep capsule networks. 
Although squash prevails in capsule networks, its saturation regions, akin to the sigmoid function, 
lead to vanishing gradient problems in deeper architectures.
Thus, Capsule ReLU is designed to better suit OrthCaps-D. 

\textbf{Contributions.}
To summarize our work, we make the following contributions:

1) A novel orthogonal sparse attention routing mechanism is proposed to replace dynamic routing.
Notably, it is the first time orthogonality has been introduced into capsule networks. 
This simple, penalty-free orthogonalization method is also adaptable to other neural network architectures.

2) A pruned capsule layer is placed to alleviate capsule redundancy 
and a new activation function named Capsule ReLU is proposed for deep capsule networks.

3) Two OrthCaps versions are created: OrthCaps-S and OrthCaps-D.
OrthCaps-S sets a new benchmark in accuracy with just 1.25\% of CapsNet's parameters on datasets of MNIST, SVHN, smallNORB, and CIFAR10.
OrthCaps-D excels on CIFAR10, CIFAR100 and FashionMNIST while keeping parameters minimal.

\section{Related Work}

\textbf{Capsule Neural Networks.}
Dynamic routing was first introduced by \citep{sabour2017dynamic}.
%and various mechanisms have been developed to improve the performance of CapsNet, 
%such as Expectation Maximization routing \citep{hinton2018matrix}, Self-routing \citep{hahn2019self}, and Variational Bayesian routing \citep{ribeiro2020capsule}. 
Although numerous studies have leveraged attention strategies \citep{hoogi2019self,peng2020bg,mazzia2021efficient} to refine dynamic routing, 
the full connection structure and redundancy introduction seldom changes\citep{sabour2017dynamic}. 
\citep{choi2019attention} incorporated attention into capsule routing via a non-iterative feed-forward operation.
\citep{tsai2020capsules}  introduced parallel iterative routing, which did not address the complexity of iterative requirements. 
Furthermore, \citep{jeong2019ladder,sharifi2021prunedcaps,renzulli2022rem} incorporated pruning, but did not account for new redundancies introduced by dynamic routing. 
\citep{jeong2019ladder} introduced a ladder structure to CapsNet, using a pruning algorithm based on code vectors.
\citep{sharifi2021prunedcaps} created a pruning layer based on Taylor Decomposition.
\citep{renzulli2022rem} utilized LOBSTER to create a sparse parse tree.
Different from existing research, this paper incorporates pruning, orthogonality and sparsity to effectively eliminate redundancy.
% In contrast to the heavy parameterization in DeepCaps \citep{rajasegaran2019deepcaps}, 
% Our approach minimizes parameter count while exploring the potential for deep multi-layer capsule networks.

\textbf{Orthogonality.}
Various methods were proposed to introduce orthogonality into neural networks, 
which can be categorized into hard and soft orthogonality. 
Hard orthogonality maintains matrix orthogonality throughout training by either optimizing over the Stiefel manifold  \Citep{li2020efficient,huang2018orthogonal}, 
or parameterizing a subset of orthogonal matrices \Citep{trockman2021orthogonalizing,singla2021skew,virmaux2018lipschitz}.
These methods incur computational overhead and result in vanishing or exploding gradients.
Soft orthogonality, on the other hand, employs a regularization term in the loss function 
to encourage orthogonality among column vectors of weight matrix without strict enforcement \citep{wang2020orthogonal,qi2020deep,huang2020controllable}. 
Yet, strong regularization overshadows the primary task loss, while weak regularization fails to effectively encourage orthogonality. 
We leverage Householder orthogonal decomposition \Citep{uhlig2001constructive,mathiasen2020if} to achieve strict matrix orthogonality, 
minimizing computational complexity and obviating the need for additional regularization terms.

\section{Methodology}

\subsection{Overall Architecture}
% To fully exploit the capabilities of capsule networks with minimal capsules, extract as many features as possible, and prevent the vanishing gradient problem caused by squashing functions, 
% We introduce OrthCaps, including both deep (OrthCaps-D) and shallow (OrthCaps-S) variants.
We introduce OrthCaps, offering both deep (OrthCaps-D) and shallow (OrthCaps-S) architectures 
to minimize parameter count while exploring the potential for deep multi-layer capsule networks. 
As illustrated in Figure (\ref{orth_structure})(a),
OrthCaps-D comprises five key components: a convolutional layer, a primary capsule layer, a pruned capsule layer, capsule blocks and a flat capsule layer.
\begin{figure}[h]
  \centering
  \includegraphics[width=1\textwidth]{./figs/orth_structure.png}
  \caption{\textbf{(a):} In CIFAR10 classification task, the OrthCaps-D model comprises 7 capsule blocks, 
  each with 3 capsule layers, interconnected via shortcut connections and orthogonal sparse attention routing.
           \textbf{(b):} The OrthCaps-S model contains two capsule layers coping with CIFAR10 and does not use any capsule layer with MNIST. 
           These layers are linked through simplified attention routing.}
  \label{orth_structure}
\end{figure}

\vspace{-13pt}
Given an input image \( x \in \mathbb{R}^{H \times W \times 3} \), 
low-level features \( \Phi^l \in \mathbb{R}^{(B,C,W^l,H^l)} \) are extracted through four convolutional layers. 
The primary capsule layer generates initial capsules \( \text{u}^l \in \mathbb{R}^{(B,n,d,W^l,H^l)} \) 
with a kernel size of 3 and stride of 2.  
A pruned capsule layer is then obtained to remove redundant capsules.
Each capsule block contains three convolutional capsule layers with depthwise convolutions and shortcut connections preventing gradient vanishing.
In capsule blocks, lower-level capsules \( u^l \) are routed to the next layer \( v^{l+1} \) via orthogonal sparse attention routing and Capsule ReLU.
The block structure permits stacking to construct deeper capsule networks. 
The flatcaps layer comprises depthwise convolutional layers with a 3x3 kernel and a stride of 2 for capsule map reduction, and 1x1 pointwise convolutions with a stride of 1 for dimensionality mapping.

OrthCaps-S, as illustrated in Figure (\ref{orth_structure})(b), 
replaces the complete attention routing with a simplified version, 
retaining a single cell with adjustable convolutional capsule layers. 
Convolutional capsules in the primary layer utilize a 9x9 kernel with a stride of 1.
Other layers and the activation function are consistent with OrthCaps-D.


\subsection{Pruned Capsule layer}


% The primary capsule layer is the initial step in capsule generation, 
% creating capsules from the convolutional layer's feature maps. 
% Low-correlated capsules at this stage not only yield more efficient feature representation 
% but also minimize feature interference during downstream processing, 
% thereby facilitating more effective capsule clustering during routing. 
% Therefore, to get low-correlated capsules, 
% we introduce a capsule pruning algorithm immediately following the primary capsule layer.

The generation of capsules starts with the primary capsule layer,  
which derives its input from feature maps of preceding convolutional layers. 
Reducing redundancy at this stage is crucial to ensure low-correlated capsules, 
allowing for efficient feature representation.  
% This not only ensures efficient feature representation 
% but also minimizes feature interference during subsequent operations, 
% thus enhancing capsule clustering during routing. 
To achieve this, we introduce an efficient capsule pruning algorithm after the primary capsule layer.
The Algorithm \ref{algo:efficient_pruning} comprises the following steps:

\textbf{Capsule Sorting:}
To ensure that the less important capsule is discarded when the similarity between a pair of capsules is high,
each capsule $u_{l,i}$ is ranked based on its $L_2$-norm \( \| u_{l,i} \|_2 \). 
This norm indicates the existence probability of entities extracted by $u_{l,i}$,
indicating the importance of capsule $u_{l,i}$.
The sorted capsules are stored in tensor $u_{\text{sorted}}$. For the 5D tensor 
\( [B, \text{num\_capsules}, \text{dim\_capsules}, W, H] \)
of $u_l$, we reshape it to
\( [B, \text{num\_capsules}, \text{dim\_capsules} \times W \times H] \) to simplify computation. 

\begin{minipage}{0.52\textwidth} %
   \textbf{Redundancy Definition:}
   The direction of each capsule vector represents specific features. 
   Capsules with closer directions indicate similar features and entities. 
   Therefore, we utilize the cosine similarity of capsule directions to measure their correlation and redundancy.
   The similarity matrix \( S \) for \( u_{\text{sorted}} \) is computed using broadcasting.

   \vspace*{5pt}
   \textbf{Pruning:}
   The capsule pair with similarity exceeding the threshold \( \theta = 0.7 \) is considered oversimilar. 
   Then, the capsule with a lower rank in list $u_{\text{sorted}}$ of the pair is deemed redundant 
   and deactivated by multiplying with a mask matrix \( M \).
   %The algorithmic details are outlined in Algorithm \ref{algo:efficient_pruning}.
\end{minipage}%
\hspace{0.02\textwidth}
\begin{minipage}{0.46\textwidth}
   % \begin{algorithm}
      % \centering
      \hrule height 0.8pt
      \vspace*{2pt}
      \captionof{algorithm}{Efficient Capsule Pruning}
      \vspace*{-9pt}
      \hrule height 0.4pt
      \begin{algorithmic}[1]
         \Require $u \in \mathbb{R}^{B \times n \times d \times W \times H}$, $\theta$
         \Ensure $u_{\text{pruned}} \in \mathbb{R}^{B \times n \times d \times W \times H}$
         \State Reshape $u$ $\rightarrow$ $u_{\text{flat}} \in \mathbb{R}^{B \times n \times (d \times W \times H)}$
         \State Compute $L_2$-norm: $\| u_{\text{flat}} \|_2$
         \State Sort capsules by $L_2$-norm: $u_{\text{sorted}}$
         
         \State Flatten $u_{\text{sorted}}$ to $u_{\text{flat}} \in \mathbb{R}^{B \times n \times (d \times W \times H)}$
         \State $S = \text{cosine\_similarity}(u_{\text{flat},i}, u_{\text{flat},j})$
         
         \State Create mask $M$ where $S > \theta$
         \State Prune using $M$: $u_{\text{pruned}} = u_{\text{sorted}} \odot M$
         
         \Return $u_{\text{pruned}}$
      \end{algorithmic}
      \label{algo:efficient_pruning}
      \hrule height 0.4pt
   % \end{algorithm}
\end{minipage}


\subsection{Orthogonal Sparse Attention Routing}

We introduce the orthogonal sparse attention routing to replace dynamic routing,
which enable non-iterative and less redundant feature transmission from lower-level to higher-level capsules.

\subsubsection{Routing Algorithm}

Let \( u_{l,i} \) and \( v_{l+1,j} \) represent capsules at layer \( l \) and \( l+1 \) respectively,
each with dimension \( d \). 
We employ three weight matrices \( W_Q \), \( W_K \), \( W_V \in \mathbb{R}^{d \times d} \)
to derive keys, queries, and values from \( u_{l,i} \). 
$Q = W_Q \times u_{l,i} , K = W_K \times u_{l,i} , V = W_V \times u_{l,i}$.
Specifically, \( W_Q \), \( W_K \), and \( W_V \) are designed as orthogonal matrices,
enabling them to project capsule \( u_{l, i} \) into a \( d \)-dimensional orthogonal subspace. 
% Orthogonality ensures that during the routing process, information from different capsules doesn't interfere with each other, 
% thus enhancing the network's representational capability.

\begin{figure}[h]
   \begin{center}
   \includegraphics[width=1  \textwidth]{./figs/attention_routing.png}
   % \fbox{\rule[-.5cm]{0cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
   \end{center}
   \caption{Orthogonal self-attention routing.}
   \label{self_attention}
\end{figure}

As shown in Figure (\ref{self_attention}), attention routing aims to produce coupling coefficients \( c_{ij} \), 
which quantifies the information transmitted from a lower-level to a higher-level capsule.
The coupling coefficient matrix \( C_{ij} \) is derived from the attention map $C$, generated through the dot product of queries and keys, 
$C = \alpha\text{-Entmax}(QK^T/\sqrt{d})$.
Here, we replace the softmax function in the original attention mechanism with the \(\alpha\)-Entmax function \citep{peters2019sparse}
to enhance the sparsity of the attention map, thereby encouraging routing to prioritize more important capsules while minimizing irrelevant information transfer.
The vote $s_{i,j}$ is computed as the product of \( V \) and $C$. 
Higher-level capsules \( v_{l+1,j} \) are generated by $s_{i,j}$ from a multi-head self-attention mechanism with 16 heads, using the nonlinear activation function $g$.
% \vspace*{-5pt}
\begin{equation}
   v_{l+1,j} = g(s_{i,j}) = g(\text{Entmax}_{(\alpha)}(QK^T/\sqrt{d}) \times V)
   \label{S_{i,j}}
\end{equation}
% \( c_{ij} \) can be viewed as a vote or prediction for the attributes of the higher-level capsule. % Here, the value \( V \) can be understood as the prediction \( \hat{u} \) of the lower-level capsule \( u \) 
% for the higher-level capsule in dynamic routing.

% Compared with dynamic routing, attention routing does not require \( T \) iterations, greatly reducing computational complexity.

For simplified attention-routing in Figure ({\ref{self_attention}}), we condense prediction matrices \( W \) from three to one
and replace \( K, Q, V \) with \( u_{l, i} \). 
\( \hat u_{l,i} \) is the prediction for \( v_{l+1,j} \). %$\hat u_{l,i} = Conv3D(u_{l,i},W)$.
The attention map $C$ is obtained using \( \alpha \)-entmax with the dot product to produce the vote \( s_{i,j} \) 
$= u_{l, i} \times C = u_{l, i} \times (\text{Entmax}_{(\alpha)}(\hat u_{l, i} u_{l, i}^T/\sqrt{d}))$. 
\( s_{i,j} \) is concatenated with \( \hat u_{l,i} \) and then processed through \( g \) to produce \( v_{l+1,j} \).
Notably, standard convolutions are supplanted by depthwise convolutions to minimize parameter count.
%Compared with dynamic routing, 
Without any iteration, attention routing reduces computational complexity.

\subsubsection{Orthogonalization of weight matrix}
 
The pruned capsule layer diminishes capsule similarity to reduce redundancy. %thereby reducing mutual feature interference during routing and improving clustering performance. 
As we analyzed in Section \ref{1}, the weight matrix modifies the capsule vector's direction, 
potentially affecting the low correlation among internal capsules of $K, Q, V$, 
thus introducing new redundancy in subsequent layers.
Orthogonal projection maintains the direction of vectors, 
thus preserving the low correlation and preventing feature overlap,  
which augments the performance of attention routing and pruning. %by mitigating feature interference across different capsules.

\begin{figure}[h]
   \begin{center}
   \includegraphics[width=0.8\textwidth]{./figs/Householder.png}
   % \fbox{\rule[-.5cm]{0cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
   \end{center}
   \caption{The computing process of HouseHolder orthogonalization method.}
   \label{Householder}
\end{figure}

\vspace*{-5pt}
Let \( W \) be the weight matrix requiring orthogonalization. 
% We seek an orthogonal matrix \( W \) satisfying \( W^T W = I \), 
% where \( I \) is the identity matrix.
As shown in Figure (\ref{Householder}), Householder orthogonal decomposition theorem is employed to formulate an endogenously optimizable orthogonal matrix. 
The essence of this approach is in the following algebraic lemma \citep{uhlig2001constructive}:

\textbf{Lemma 1}: Any orthogonal \(n \times n\) matrix is the product 
of at most \(n\) orthogonal Householder transformations.

Let \(d\) represent the dimension of the capsule. 
Based on Lemma 1, an orthogonal matrix \(W_v \in \mathbb{R}^{d \times d}\) can be formulated in Equation (\Ref{W1}):

\begin{equation}
   W = H_0 H_1 \dots H_{d-1}
   \label{W1}
\end{equation}

Each \(H_i\) represents a Householder transformation, defined as \(H_i = I - 2a_i a_i^T\), 
where \(a_i\) is a unit column vector. 
We utilize a set of randomly generated column vectors 
\(\{b_i | i = 0, \dots, d - 1\}\) instead of \(a_i\) to construct \(H_i\) as detailed in Equation (\Ref{W2}).
During training,  \(b_i\) is optimized through gradient backpropagation.
% Since Lemma 1 serves as a necessary and sufficient condition(Proof is provided in the Appendix A.1), 
\(W\) inherently preserves its orthogonality during training.

\begin{equation}
   W = \prod_{i=0}^{d-1} \left( I - \frac{2 b_i b_i^T}{{\|b_i\|}^2} \right)
   \label{W2}
\end{equation}

\textbf{Lemma 2}: \( W_Q \), \( W_K \), and \( W_V \) constructed using Equation (\Ref{W2}) are Orthogonal.

following Equation (\ref{W2}), 
\( W_Q \), \( W_K \), and \( W_V \) could easily be orthogonalized. %by Householder orthogonal decomposition.
The proof is provided in Appendix \ref{A.3.1}.
The proposed orthogonalization method for weight matrices is generalizable to any neural network, 
not limited to capsule networks. 
Householder orthogonalization enables computationally efficient transformation of 
arbitrary coefficient matrices into orthogonal matrices without any additional penalty terms in the loss function.
\vspace*{-5pt}
\subsection{Capsule Relu}

The activation function is an indispensable part of routing. 
However, as shown in Figure (\Ref{squash}) in Appendix \ref{A.4.2}, squash has saturation regions similar to the sigmoid function,
which may result in vanishing gradients during backpropagation. %and impair deep capsule network architectures.
Therefore, we incorporate a capsule structure with ReLU, which lacks a saturation region and avoids the vanishing gradients.

The squash function, $v_j = \frac{{||s_j||^2}}{{1 + ||s_j||^2}} \frac{{s_j}}{{||s_j||}}$, serves two primary functions: 
constraining the capsule length to interval $[0,1]$ and preserving the capsule's direction.
Replacing squash with ReLU directly would compromise these essential properties, 
leading to a large decline in network performance.
To resolve this, we integrate ReLU with BatchNorm to compress the capsule length while maintaining its direction, 
as outlined in Equation (\ref{capsule_relu}):
% \vspace*{-5pt}
\begin{equation}
   v_j=ReLU\left(BatchNorm\left(\left|\left|s_j\right|\right|^2\right)\right)\frac{s_j}{\left|\left|s_j\right|\right|^2}
   \label{capsule_relu}
\end{equation}
% \vspace*{-5pt}
In contrast to the neuron-level ReLU, Capsule ReLU performs group-level activation on capsules. If the 
$L_2$-norm of a capsule falls below zero after batch normalization, all elements within that capsule are zeroed out, thereby introducing sparsity.

% \vspace*{-3pt}
\section{Experiments}
\vspace*{-3pt}
\subsection{experimental setup}
\textbf{Implementation details and datasets}

We implemented OrthCaps using PyTorch 1.12.1 on Python 3.9. 
For training, we adopted the margin loss as defined in \citep{sabour2017dynamic}. 
We opted to exclude the reconstruction loss, observing minimal performance benefits in our experiments. 
Our model utilized the AdamW optimizer, combined with a cosine annealing learning rate scheduler and a 5-cycle linear warm-up. 
The initial configurations are learning rate at 5e-3, weight decay at 5e-4, and a batch size of 512. 
The training was facilitated by four GTX-3090 GPUs. 
We conducted experiments on SVHN\citep{netzer2011reading}, smallNORB\citep{lecun2004learning}, affNIST\citep{sabour2017dynamic}, CIFAR10, and MNIST\citep{lecun1998mnist} for OrthCaps-S.
OrthCaps-D was trained and tested on CIFAR10, CIFAR100\citep{krizhevsky2009learning}, Fashion-MNIST\citep{xiao2017fashion}, and MNIST.
We resized SmallNORB from $96\times96$ to $64\times64$ and subsequently cropped it to $48\times48$, 
in line with \citep{sabour2017dynamic}. All other datasets retained their original sizes, and data augmentation followed by \citep{hinton2018matrix}.
To facilitate reproducibility, we have detailed the hyperparameters in Appendix \ref{A.2}.

\textbf{Comparison baselines}

We benchmarked OrthCaps against various baseline models. For OrthCaps-S, 
we compared it with Efficient-Caps\citep{mazzia2021efficient}, CapsNet\citep{sabour2017dynamic}, Matrix-CapsNet with EM routing\citep{hinton2018matrix}, 
AR CapsNet\citep{choi2019attention}, AA-Capsnet\citep{pucci2021self}, DA-CapsNet\citep{huang2020capsnet} and a standard 7-layer CNN. 
For OrthCaps-D, we used baselines such as 
CapsNet (7 ensembles), AR CapsNet (7 ensembles), RS-Capsnet\citep{9086631}, 
Inverted Dot-Product\citep{tsai2020capsules}, DeepCaps\citep{rajasegaran2019deepcaps}, ResNet-18\citep{he2016deep}, and VGG-16\citep{simonyan2014very}. 
All comparative results were derived from running official codes with the same hyperparameters in Appendix \ref{A.2}.

\vspace*{-5pt}
\subsection{Classification performance Comparison}
Table (\ref{tabel2}) illustrates the classification performance of OrthCaps-S and OrthCaps-D,
with model sizes denoted by Param and computational demands represented as FLOPS[M]. 
All models utilize a backbone of 4 convolutional layers and undergo training for 300 epochs.
The Param and FLOPS[M] of each table are tested on MNIST and CIFAR10, respectively.  
An asterisk (*) signifies that no official code is available, so we refer to the model performance stated in the original papers. 

\vspace*{-15pt}

\begin{table}[H]
   \centering
   \begin{table}[H] 
      \centering
      \resizebox{0.9\linewidth}{!}{
      \begin{tabular}{cccccccc}
         \toprule
         Shallow Networks & Param$\downarrow$ & FLOPS[M]$\downarrow$ & MNIST  & SVHN & smallNORB & CIFAR10 \\ 
         \midrule
         OrthCaps-S & \textbf{105.5K} & 673.1 & \textbf{99.68} & \textbf{96.26} & \textbf{98.30} & \textbf{87.92} \\
         Efficient-Caps & 162.4K & \textbf{631.1} & 99.58  & 93.12 & 97.46 & 81.51 \\
         Capsnet & 8388 & 803.8K & 99.52 & 91.36 & 95.42 & 68.72 \\
         Matrix-CapsNet with EM routing & 450K & 949.6 & 99.56 & 87.42 & 95.56 & 81.39 \\
         AR CapsNet & 9.1M & 2562.7 & 99.46 & 85.98 & 96.47 & 85.39 \\
         DA-CapsNet & 7M* & - & 99.53* & 94.82* & 98.26* & 85.47* \\
         AA-Caps & 6.6M* & - & 99.34* & 91.23* & 89.72* & 79.41* \\
         % HitNet & - & - & 99.56 & 94.30 & - & 77.33 \\
         \midrule
         Baseline CNN & 4.6M & 1326.9 & 99.22 & 91.28 & 87.11 & 72.20 \\
         \bottomrule
         \end{tabular}
      }
      % \subcaption{}
   \end{table}
   
   \vspace*{-20pt}
   \begin{table}[H] 
      \centering
      \resizebox{0.9\linewidth}{!}{
      \begin{tabular}{ccccccc}
         \toprule
         Deep Networks & Param $\downarrow$ & FLOPS[M]$\downarrow$ & CIFAR10 & CIFAR100 & MNIST & FashionMNIST \\ 
         \midrule
         OrthCaps-D (simplified routing) & \textbf{164K} & 3156 & 89.09 & 67.43 & \textbf{99.72} & 93.19 \\
         OrthCaps-D (complete routing) & 574K & 3345 & 90.56 & \textbf{70.56}  & 99.59 & \textbf{94.60} \\
         AR CapsNet(7 ensembled) & 6.3M & 16657.5 & 88.94 & 56.53 & 99.49 & 91.73 \\
         Capsnet(7 ensembled) & 5.8M* & 5137.4* & 89.4* & - & - & - \\
         Inverted Dot-Product & 1.4M & 5340.9 & 84.98 & 57.32 & 99.35 & 92.85 \\
         RS-CapsNet & 5.0M* & - & 89.81* & 64.14* & - & 93.51* \\
         DeepCaps & 13.5M & \textbf{2687} & \textbf{91.01} & 69.72 & 99.46 & 92.52 \\
         \midrule
         ResNet-18\footnotemark[1] & 11.7M & 5578.8 & 95.10 & 77.60 & 99.29 & 93.32  \\
         VGG-16\footnotemark[1] & 147.3M & 15143.1 & 93.57 & 73.10 & 99.21 & 92.21 \\
         \bottomrule
         \end{tabular}
      }    
      % \subcaption*{(b)}
   \end{table}

   \vspace*{-12pt}

   \caption{\textbf{Top: } OrthCaps-S ranks as the top or second best across five datasets, 
   standing out as being resource-efficient with only 105.5K parameters and 673.1M FLOPS.
   \textbf{Bottom: } OrthCaps-D shows competitive performance with fewer parameters and
   less computational cost. }
   \label{tabel2}
\end{table}
\setcounter{footnote}{1}
\footnotetext{https://github.com/kuangliu/pytorch-cifar}

\vspace*{-12pt}
As shown in Table (\ref{tabel2}), 
OrthCaps-S achieves superior efficiency with merely 105.5K parameters, outperforming CNN, CapsNet, and many variants. 
For instance, Efficient-Caps, a state-of-the-art model on efficiency, has about 50\% more parameters.
Despite its compact design, OrthCaps-S either outperforms or matches the performance of other capsule network designs across all four datasets. 
On the SVHN and CIFAR10, OrthCaps-S achieves accuracies of 96.26\% and 87.92\%, respectively, surpassing CapsNet which has 80 times more parameters.
With a computational demand of 673.1M FLOPS, 
it's worth noting that the slight increase in FLOPS compared to Efficient-Caps 
is due to the additional computations from the pruned capsule layer and orthogonal transformations. 
Given the substantial decrease in parameter count and the enhanced accuracy of both networks, this FLOPS trade-off is warranted.

For OrthCaps-D, as illustrated in Table (\ref{tabel2}), 
it exhibits competitive performance with fewer parameters and less computational cost.
On complex datasets, OrthCaps-D delivers compelling results using fewer parameters. 
Although convolution-based networks such as ResNet-18 and VGG-16 perform well on CIFAR10 and CIFAR100, 
OrthCaps-D offers competitive performance using just 1.41\% and 0.11\% of their parameters as well as 56\% and 20.8\% of their FLOPS, respectively.
The efficiency of OrthCaps becomes evident when compared with DeepCaps.
While DeepCaps achieves a 91.01\% accuracy on CIFAR10, 
its significant parameter count of 13.42M highlights a compromise. 
It's noteworthy that both OrthCaps variants maintain high performance with fewer parameters.


\subsection{Ablation study}

\begin{figure}[h]
   \vspace*{-25pt}
   \centering
   \begin{minipage}{0.45\textwidth}
      \centering
      \includegraphics[width=1\textwidth]{./figs/CRELU.png}
      {\small\subcaption{}}
   \end{minipage}
   % \hspace{0.01\textwidth}
   \begin{minipage}{0.54\textwidth}
      \vspace*{-15pt}
      \begin{table}[H] 
      \centering
      \resizebox{0.9\linewidth}{!}{
      \begin{tabular}{ccc}
         \toprule
         Variants & FPS$\uparrow$ & ACC$\uparrow$ \\ 
         \midrule
         Attention routing \& $\alpha$-entmax \& orthogonality & 1639 & 99.69 \\
         Attention routing \& softmax & 1785 & 99.62  \\
         Dynamic routing \& $\alpha$-entmax \& orthogonality & 1232 & 99.51 \\
         Dynamic routing \& softmax & 1339 & 99.49\\
         \bottomrule
         \end{tabular}
      }
      {\footnotesize\subcaption*{(b)}}
      \end{table}
      \vspace*{-20pt}
      \begin{table}[H]
      \centering
      \resizebox{0.75\linewidth}{!}{
      \begin{tabular}{ccc}
            \toprule
            Variant & Param[K]$\downarrow$ & ACC$\uparrow$ \\ 
            \midrule
            OrthCaps-S with pruning & 105 &  99.69 \\
            OrthCaps-S & 127 & 99.63 \\
            Capsnet with pruning & 7492 & 99.45\\
            Capsnet & 8388 & 99.42 \\
            \bottomrule
            \end{tabular}
      }
      {\small\subcaption*{(c)}}
      \end{table}         
   \end{minipage}
   \vspace*{-7pt}
   \caption{\textbf{Ablation study results.}
            \textbf{(a):} Test accuracy curve of different activation functions of OrthCaps-D and OrthCaps-S model on CIFAR-10. We train for 200 epochs with a learning rate of \(0.001\) 
            and decayed the learning rate to \(80\%\) of its original value at epochs 60, 120, and 160.
            \textbf{(b):} Comparison of Orthogonal sparse attention routing and dynamic routing algorithms on MNIST. We report the performance of OrthCaps-S trained 300 epochs.
            \textbf{(c):} CapsNets are compared with and without the pruning layer on the MNIST dataset, with the similarity threshold set to 0.7.}
   \label{ablation}
   \vspace*{-15pt}
\end{figure}




\textbf{Orthogonal Self-Attention Routing}

Through a cross-comparison of accuracy (ACC) and frames-per-second (FPS), 
as shown in the Table (b) of Figure (\ref{ablation}), 
we contrast attention routing with dynamic routing\citep{sabour2017dynamic} and compare sparse softmax with standard softmax. 
Attention routing consistently outperforms dynamic routing in both classification accuracy and processing speed, 
achieving a 25.8\% speed enhancement on average. 
Even with a faster softmax, 
dynamic routing only reaches 1339 FPS, indicating its inherent computational inefficiencies.  
While the complexity of $\alpha$-entmax over softmax 
and the added computations from orthogonality lead to a slight decrease in speed compared to using the softmax,
the trade-off brings a significant accuracy boost at a small reduction in FPS.
Overall, our attention routing combined with $\alpha$-entmax and orthogonality 
balances performance and computational efficiency.

% \begin{table}[H] 
%    \centering
%    \resizebox{0.8\linewidth}{!}{
%    \begin{tabular}{ccccc}
%        \toprule
%        Variants & fps$\uparrow$ & MNIST & CIFAR-10 \\ 
%        \midrule
%        Attention routing \& $\alpha$-entmax \& orthogonality  & 1639 & 99.69 & 87.92 \\
%        Attention routing \& softmax & 1685 & 99.62 & 85.23 \\
%        dynamic routing \& $\alpha$-entmax \& orthogonality & 1232 & 99.51 & 82.30 \\
%        dynamic routing \& softmax & 1339 & 99.49 & 81.72 \\
%        \bottomrule
%        \end{tabular}
%    }
%        \label{tabel5}
%        \caption{Comparison of Orthogonal sparse attention routing and dynamic routing algorithms. we report the performance of OrthCaps-S trained 300 epochs.}
%    \end{table}



% \vspace*{10pt}
\textbf{Pruned Capsule Layer}

Figure (\ref{similarities}) illustrates that by integrating the pruned layer, 
the average capsule similarity decreases significantly due to redundant capsule elimination.  
Consequently, as the capsule count reduces, the dimensions of the associated prediction matrix diminish, 
thereby lowering the parameter count. 
This is proved in Table (c) of Figure(\ref{ablation}),
where the pruned version of OrthCaps-S has fewer parameters, reduced from 127K to 105K.  
Despite the reduction, performance is not compromised; 
in fact, the pruned model achieves an improved accuracy of 99.69\%, compared to 99.63\% for the unpruned version. 
When applying similar pruning to CapsNet, it yields an accuracy of 99.45\% with 7492K parameters, which is fewer than its original 8388K parameters.
Evidently, 
our pruning approach not only streamlines the model by eliminating redundant capsules but also slightly enhances its performance.

\begin{figure}[h]
   \centering
      \begin{minipage}{0.45\textwidth} 
         \vspace*{-5pt}
         Figure (\ref{pruned}) illustrates the necessity of incorporating pruning with orthogonality. 
         Capsule similarity is gauged with cosine similarity to measure the redundancy as mentioned above. 
         As the network goes deeper, the dashed line (indicating pruning without orthogonality) shifts rightward, 
         suggesting an increase in capsule similarity and redundancy.
         This shift proves that the alteration in capsule direction from the weight matrix reintroduces redundancy. 
         In contrast, the solid line (indicating pruning with orthogonality) demonstrates consistently low capsule similarity.
         Even at 28 layers deep, the similarity remains below the 0.7 pruning threshold, 
         affirming the efficacy of orthogonality in preserving capsule directions to maintain low inter-capsule correlations.
         The black dash-dot line denotes similarity without orthogonality and pruning, 
         exhibiting the highest redundancy, further emphasizing the significance of our pruning approach.
      \end{minipage}%
      \hspace*{6pt}
      \begin{minipage}{0.54\textwidth}
         \vspace*{-10pt}
         \centering
         \includegraphics[scale=0.4]{./figs/pruned.png}
         \vspace*{-7pt}
         \caption{\textbf{Redundancy comparison between different pruning strategies. (The more to the left, the better.)}
         The x-axis shows capsule similarity; 
         the y-axis indicates capsule count percentage. 
         PCL, C3, and C28 mark the primary, third, and twenty-eighth capsule layers. 
         Solid (C3-Orth, C28-Orth) and dashed lines contrast pruning with and without orthogonality; 
         the dash-dot line shows no pruning or orthogonality.} 
         %Tests are conducted on OrthCaps-S with the MNIST dataset.}
         \label{pruned}
      \end{minipage}
\end{figure}

\clearpage
\textbf{Capsule ReLU}

Figure (\ref{ablation})(a) presents the performance of various activation functions on CIFAR10.  
In shallow networks, squash converges faster, evident from its steep accuracy trajectory.
However, in deeper networks, squash without batch normalization struggles to learn useful information during training, 
leading to an accuracy near 10\%. 
Introducing batch normalization to this deep squash model enhances its accuracy, underscoring the importance of batch normalization.
In contrast, models employing Capsule ReLU as their activation function show quicker convergence during gradient optimization and achieve superior local minima.


\subsection{Robustness to adversarial attacks}

Capsule networks have demonstrated exceptional performance in terms of robustness\citep{hinton2018matrix}.
Considering OrthCaps as it eliminates redundant capsules to effectively suppress low $L_2$-norm capsules, 
which we consider as noise capsules \citep{de2020introducing}.
It can enhance better robustness against small perturbations.
To evaluate this, 
we conduct a robustness comparison between OrthCaps, Capsule Networks, Orthogonal CNNs(OCNN) and 7-layer CNNs using the CIFAR10 dataset. 
We employ the Projected Gradient Descent (PGD) white-box attack method\citep{guo2019simple}, setting the maximum iteration count at 
40, step size at 0.01, and the maximum perturbation at 0.1.
We assess the robustness using three metrics: attack time(AT), model query count(QC), and accuracy after attacks(Acc). 
As shown in Table (\ref{robustness}), OrthCaps outperforms in all three metrics, 
confirming its superior handling of complex spatial structures. 
Specifically, OrthCaps requires 1.72 times the number of queries compared to CapsNet and exhibits a 
9\% higher success rate, showing its robustness in image classification.

\begin{minipage}{0.57\textwidth}
   \vspace*{-5pt}
   \begin{table}[H] 
      \centering
      \resizebox{0.77\linewidth}{!}{
      \begin{tabular}{cccc}
          \toprule
          Variants & AT(s) $\uparrow$ & QC[K] $\uparrow$ & Acc $\uparrow$\\ 
          \midrule
          OrthCaps & \textbf{345.92} & \textbf{69K} & \textbf{23.52\%} \\
          CapsNet & 198.93 & 48K & 14.62\% \\
          OCNN & 136.7 & 46K & -\\
          baseline CNN & 16.65 & 10K & 0.35\%\\
          \bottomrule
          \end{tabular}
      }
      \vspace*{-5pt}
      \caption{Comparison of 7 ensembled OrthCaps, CapsNet, OCNN and baseline CNN under PGD attack. The CIFAR10 dataset is used without any data augmentation.}% Our metrics are an average of 5 test runs. }
      \label{robustness}
   \end{table}
\end{minipage}
\hspace*{-10pt}
\begin{minipage}{0.44\textwidth}
   \vspace*{-10pt}
   \begin{table}[H]
       \centering
       \caption{Orthogonality of weight matrices in attention routing of SVHN dataset. $O$ decreases from 0.02 to 0.01 during training.}
       \vspace*{-5pt}
       \resizebox{0.58\linewidth}{!}{
       \begin{tabular}{ccc}
             \toprule
             EPOCH  & ACC & $O$ $\downarrow$\\ 
             \midrule
             1 &  83.75 & 0.0236 \\
             10 &  98.58 & 0.0215\\
             100 &  99.42 & 0.0153  \\
             300 & 99.56 & 0.0120 \\
             \bottomrule
             \end{tabular}
       }
       \label{tabel3}
   \end{table}
\end{minipage}

% \begin{table}[H] 
%    \centering
%    \resizebox{0.5\linewidth}{!}{
%    \begin{tabular}{cccc}
%        \toprule
%        Variants & AT(s) $\uparrow$ & NAQ[K] $\uparrow$ & ACC $\uparrow$\\ 
%        \midrule
%        OrthCaps & 345.92 & 96K & 31.18\% \\
%        CapsNet & 198.93 & 54K & 20.81\% \\
%        OCNN & 136.7 & 46K & -\\
%        baseline CNN & 16.65 & 10K & 3.53\%\\
%        \bottomrule
%        \end{tabular}
%    }
%    \caption{Comparison of OrthCaps, CapsNet, OCNN and baseline CNN under PGD attack. The Cifar10 dataset is used without any data augmentation. Our metrics are an average of 5 test runs. }
%    \label{robustness}
% \end{table}

\vspace*{-2pt}
\subsection{Orthogonality}

% \begin{minipage}{0.52\textwidth} 
  
% \end{minipage}
% \hspace{0.02\textwidth}
% \begin{minipage}{0.45\textwidth}
%    % \vspace*{-15pt}
%    \begin{table}[H]
%        \centering
%        \resizebox{0.6\linewidth}{!}{
%        \begin{tabular}{ccc}
%              \toprule
%              EPOCH  & Acc & $O$ $\downarrow$\\ 
%              \midrule
%              1 &  83.75 & 0.0236 \\
%              10 &  98.58 & 0.0215\\
%              100 &  99.42 & 0.0153  \\
%              300 & 99.56 & 0.0120 \\
%              \bottomrule
%              \end{tabular}
%        }
%        \label{tabel3}
%        \caption{Orthogonality of weight matrices in attention routing of SVHN dataset. $O$ decreases from 0.02 to 0.01 during training.}
%    \end{table}
% \end{minipage}

This experiment demonstrates the effectiveness of the HouseHolder orthogonalization method 
and its advantages over other orthogonalization methods. 
We define an orthogonality metric \( O = \| K^T K - I \| \). 
%A smaller \( O \) indicates better orthogonality. 
%When \( O < 0.02 \), the matrix is considered to be fully orthogonal.
In Table (\ref{tabel3}), \( O \) decreases from 0.02 to 0.01 during training, 
substantiating the effectiveness of the orthogonalization method. 
In Figure (\ref{orth}), our method achieves better orthogonality and loss decay than OCNN \citep{wang2020orthogonal}.

We further demonstrate Householder's role as a regularization technique for neural networks, detailed in Appendix \ref{A.3.2}
% We demonstrate Householder's role as a regularization technique for neural networks. 
% For ResNet18, we flatten and concatenate convolutional kernels into a matrix $W$, 
% and orthogonalize it to minimize off-diagonal elements, which reduces channel-wise filter similarity and redundancy. 
% To quantify these properties, we used Guided Backpropagation to dynamically visualize the activations. 
% % Compared to directly computing the covariance matrix of convolutional kernels, 
% % The gradient-based covariance matrix offers a more comprehensive view of the dynamic behavior of filters. 
% We define the gradients from Guided Backpropagation as $G$ and compute its gradient correlation matrix $corr(G)$ as:
% \( \left(\text{diag}(K_{GG})\right)^{-\frac{1}{2}} K_{GG} \left(\text{diag}(K_{GG})\right)^{-\frac{1}{2}} \),
% where \( K_{GG} = \frac{1}{M} \left( (G - \mathbb{E}[G]) (G - \mathbb{E}[G])^T \right) \), $M$ is the number of channels.
% Figure (\ref{orth}) of the off-diagonal elements of 
% $corr(G)$ shows a left-shifted distribution for the Householder-regularized model, 
% confirming its effectiveness in enhancing filter diversity and reducing redundancy.




\vspace*{-5pt}
\section{Conclusions and Future Work}

In this study, we have introduced a novel capsule network with orthogonal sparse attention routing. 
% eliminating the need for iterative clustering in dynamic routing. 
Specifically, Householder orthogonal decomposition is used to ensure strict matrix orthogonality in attention routing without additional penalty terms, 
and the capsule pruning layer introduces sparsity into routing, minimizing capsule redundancies. 
Our new activation function called Capsule ReLU mitigates the vanishing gradient problem.
It has been shown in experiments that OrthCaps has lower parameters and reduces computational overhead, 
overcoming the challenges of computational expense and redundancy in dynamic routing. 
On image classification tasks, OrthCaps outperforms state-of-the-art methods, 
demonstrating improved robustness. 
This work paves the way for future research in capsule networks, 
and we look forward to further developments in this area.

% \subsubsection*{Acknowledgments}
% Use unnumbered third-level headings for the acknowledgments. All
% acknowledgments, including those to funding agencies, go at the end of the paper.

\newpage
\bibliography{iclr2024_conference}
\bibliographystyle{iclr2024_conference}

\newpage
\appendix
\section{Appendix}

\subsection{Symbols and abbreviation used in this paper}

\begin{table}[H] 
   \centering
   \resizebox{0.8\linewidth}{!}{
   \begin{tabular}{ll}
       \toprule
       Symbol & Description  \\ 
       \midrule
       OrthCaps & Orthogonal Capsule Network  \\
       OrthCaps-S & Shallow network variant of OrthCaps \\
       OrthCaps-D & Deep network variant of OrthCaps \\
       $x$ & Input image \\
       $l$ & Layer index \\
       $\Phi^l$ & Features from convolutional layers at level $l$\\
       $u_l$ & Capsules at layer $l$ \\
       $v_{l+1}$ & Capsules at layer $l+1$ \\
       $n$ & Capsule count in a given layer \\
       $d$ & Capsule dimension \\
       $W$ & Feature map width \\
       $H$ & Feature map height \\
       $B$ & Batch size \\
       $u_{flat}$ & Flattened capsules \\
       $u_{sorted}$ & Capsules sorted by their $L_2$-norm \\
       $u_{pruned}$ & Pruned capsules \\
       $M$ & Mask matrix for pruned capsule layer \\
       $S$ & Cosine similarity matrix for pruned capsule layer \\
       $\theta$ & Threshold for pruned capsule layer \\
       $W_{ij}$ & Weight matrix for simplified attention routing \\
       $Q, K, V$ & Attention routing components: Query, Key, Value \\
       $W_Q, W_K, W_V$ & Weight matrices for Q, K, V \\
       $C_{i,j}$ & Coefficient matrix for attention routing \\
       $s_{i,j}$ & Votes for attention routing \\
       $g$ & Activation function \\
       $H$ & Householder matrix \\
       $a_{i}$ & Unit vector in Householder matrix formulation \\
       $b_{i}$ & Learnable vector in Householder matrix \\
       \bottomrule
       \end{tabular}
   }
\end{table}

\subsection{hyperparameters}\label{A.2}

\begin{table}[H] 
   \centering
   \resizebox{0.8\linewidth}{!}{
   \begin{tabular}{ll}
       \toprule
       Hyperparameter & Value  \\ 
       \midrule
       Batchsize & 512 (4 paralleled)  \\
       Learning rate & 5e-3 \\
       Weight decay & 5e-4 \\
       Optimizer & AdamW \\
       Scheduler & CosineAnnealingLR and 5-cycle linear warm-up \\
       Epochs & 300 \\
       Data augmentation & RandomHorizonFlip, RandonClip with padding of 4 \\
       Dropout & 0.25 \\
       $m^+$ & 0.9 \\
       $m^-$ & 0.1 \\
       $\lambda$ & 0.5 \\
       $\theta$ & 0.7 \\
       $d$ & 16 \\
       \bottomrule
       \end{tabular}
   }
\end{table}

\subsection{HouseHolder Orthogonalization}
\subsubsection{Proof of Lemma 2}\label{A.3.1} 

\textbf{Proof:}

Let $W$ represent one of $W_Q$,$W_K$,$W_V$ as $W$ can be expressed as
\begin{equation} \label{householderMulti}
	W = H_0H_1 \dots H_{d-1}
\end{equation}
where $H_i = I- 2a_ia_i^T$. We have
\begin{equation} \label{2orthogonal}
	W^TW = H_{d-1}^T \dots H_1^T H_0^T H_0H_1 \dots H_{d-1}
\end{equation}

We demonstrate that $H_i$ is orthogonal, i.e. $H_i^T H_i = I$. This is obvious, as
\begin{equation} \label{2householder}
	\begin{aligned}
		H_i^T H_i &= (I - 2a_ia_i^T)^T (I - 2a_ia_i^T) \\
		          &= I - 4a_ia_i^T + 4 a_i a_i^T = I
	\end{aligned}
\end{equation}

Therefore, Equation (\ref{2orthogonal}) can be written as $W^TW = \underbrace{I \dots I}_d = I$.

% Based on our prior knowledge of Householder transformations, 
% matrices derived through this approach inherently possess orthogonality. 
% We further explore the orthogonality of 
% \( W_Q \), \( W_K \), and \( W_V \) by utilizing the principles of Householder Orthogonalization.

% \textbf{Lemma 1: Every real orthogonal $n \times n$ matrix
% $U$ is the product of $n-m$ real orthogonal Householder matrices}

% Assume that \( U \neq I_n \), 
% we identify a vector \( x \neq 0 \) with \( Ux \neq x \). 
% Without loss of generality, let's assume \( x \) is normalized, resulting in \( x^T x = 1 \).

% Define:
% \begin{equation}
% v := \frac{Ux - x}{\|Ux - x\|_2}
% \end{equation}
% where the denominator represents the Euclidean 2-norm. Note that \( v^T v = 1 \) and
% $\|Ux - x\|_2^2 = 2 - 2x^T Ux$

% For the Householder transform defined as \( H_v := I_n - 2vv^T \), 
% we aim to show that \( HvU \) has an eigenspace \( E_{HvU}(1) \) for the eigenvalue 1
% that contains the eigenspace \( E_U(1) \) for the eigenvalue 1 of \( U \) and is at least one dimension larger.
% i.e., $\dim(EHvU(1)) \geq \dim(EU(1)) + 1$ and $EU(1) \subset EHvU(1)$.

% By computation, we have:
% \begin{equation}
% H_vUx = Ux - \frac{2(2 - 2x^T Ux)(Ux - x)(Ux - x)^T Ux}{1 - x^T Ux} = x
% \end{equation}

% This shows that \( E_U(1) \subseteq E_{H_vU}(1) \). 
% By continuing this procedure inductively with \( HvU \) in place of \( U \) 
% until $ n = dim(ker(H_z \cdots H_y H_v U - I_n)) $, 
% we conclude that any real orthogonal \( n \times n \) matrix \( U \) 
% can be expressed as the product of at most $ n - dim(ker(U - I_n)) $ 
% real orthogonal Householder matrices.

% \textbf{Lemma 2: The product of \( n \) Householder matrices remains orthogonal}

% Given a unit vector \( a \) (i.e., \( a^T a = 1 \)), the Householder matrix \( H \) is defined as:
% \begin{equation}
%   H = I - 2aa^T
% \end{equation}
% where \( I \) denotes the identity matrix.

% Firstly, we prove that a single Householder matrix \( H \) is orthogonal, that is, \( H^T H = I \).

% Considering:
% \begin{equation}
%  H^T H = (I - 2aa^T)^T (I - 2aa^T)
% \end{equation}
% this can be expanded as:
% \begin{equation}
%  H^T H = I - 2a(a^T) - 2(a^T)a + 4aa^Taa^T
% \end{equation}
% Given \( a^T a = 1 \):
% \begin{equation}
%  H^T H = I - 2a(a^T) - 2(a^T)a + 4a(a^T)a(a^T) = I
% \end{equation}
% Thus, the Householder matrix is orthogonal.

% Assume the product of \( n = k \) Householder matrices is orthogonal. We aim to prove for \( n = k + 1 \) Householder matrices.

% Consider the product \( H_{k+1}H_k \cdots H_1 \). By the inductive hypothesis, \( H_k \cdots H_1 \) is orthogonal. Since the product of two orthogonal matrices remains orthogonal, the product of \( H_{k+1} \) (a Householder matrix) with \( H_k \cdots H_1 \) is also orthogonal.

% Thus, we have shown that the product of \( n = k + 1 \) Householder matrices is orthogonal.
% We conclude that the product of any \( n \) Householder matrices is an \( n \times n \) orthogonal matrix.
% This is a powerful statement as it allows us to generate larger orthogonal matrices by combining smaller ones.

% \textbf{Main Proof: \( W_Q \), \( W_K \), and \( W_V \) are Orthogonal}

% Based on Lemma 1, every real orthogonal $n \times n$ matrix $U$ can be expressed as a product of real orthogonal Householder matrices. 
% Using the Householder construction ensures that these matrices inherently possess orthogonality. 
% Lemma 2 further emphasizes that the product of Householder matrices maintains this orthogonality.
% We aim to construct \( W_Q \), \( W_K \), and \( W_V \) into orthogonal matrices.

% Given the capsules \( u_{l,i} \) and \( v_{l+1,j} \) at layer \( l \) and \( l+1 \) respectively, 
% each with dimension \( d \), 
% we have three weight matrices \( W_Q \), \( W_K \), and \( W_V \) 
% to derive keys, queries, and values from capsule \( u_{l,i} \):

% \begin{align*}
% Q &= W_Q \times u_{l,i} \\
% K &= W_K \times u_{l,i} \\
% V &= W_V \times u_{l,i}
% \end{align*}

% Following Householder's construction, \( W_Q \) is defined as:
% \begin{equation}
% W_Q = H_0 H_1 \dots H_{d-1} = \prod_{i=0}^{d-1} \left( I - 2 a_i a_i^T \right) = \prod_{i=0}^{d-1} \left( I - \frac{2 b_i b_i^T}{{\|b_i\|}^2} \right)
% \end{equation}
% where \( a \) is a unit column vector and \( b \) is a random column vector. $W_K $ and $W_V$ are constructed in the same way. 

% Following Lemma 2, as \( b \) is updated by gradient backpropagation, 
% \( W_Q \), \( W_K \), and \( W_V \) remains orthogonal matrices.
% Their orthogonality is ensured because they result from products of Householder matrices.
% $\mathbf{W}_Q$, $\mathbf{W}_K$, and $\mathbf{W}_V$ transform capsule $\mathbf{u}_{l,i}$ into $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$, respectively. 
% They project capsule \( u_{l, i} \) into an orthogonal subspace. 
% This ensures that \( Q \), \( K \), and \( V \) also reside in orthogonal subspaces.
% Due to the inherent orthogonality of these matrices, 
% the transformation is isometric, conserving distances and angles in the transformed domain.

\subsubsection{Householder as a Regularization Technique}\label{A.3.2} 

We demonstrate Householder's role as a regularization technique for neural networks. 
For ResNet18, we flatten and concatenate convolutional kernels into a matrix $W$, 
and orthogonalize it to minimize off-diagonal elements, which reduces channel-wise filter similarity and redundancy. 
To quantify these properties, we used Guided Backpropagation to dynamically visualize the activations\Citep{wang2020orthogonal}. 
Compared to directly computing the covariance matrix of convolutional kernels, 
The gradient-based covariance matrix offers a more comprehensive view of the dynamic behavior of filters. 
We define the gradients from Guided Backpropagation as $G$ and compute its gradient correlation matrix $corr(G)$ as:
\begin{equation}
   \left(\text{diag}(K_{GG})\right)^{-\frac{1}{2}} K_{GG} \left(\text{diag}(K_{GG})\right)^{-\frac{1}{2}}
\end{equation}

where \( K_{GG} = \frac{1}{M} \left( (G - \mathbb{E}[G]) (G - \mathbb{E}[G])^T \right) \), 
$M$ is the number of channels.
Figure (\ref{orth}) of the off-diagonal elements of 
$corr(G)$ shows a left-shifted distribution for the Householder-regularized model, 
confirming its effectiveness in enhancing filter diversity and reducing redundancy.

\vspace*{-5pt}
\begin{figure}[h]
   % \hspace{0.01\textwidth}
   \begin{minipage}{0.42\textwidth}
       \centering
       \includegraphics[width=\textwidth]{./figs/distribution1.png}
       \subcaption{}
   \end{minipage}
   % \hspace{0.01\textwidth}
   \begin{minipage}{0.58\textwidth}
       \centering
       \includegraphics[width=\textwidth]{./figs/ortho.png}
       \subcaption{}
   \end{minipage}
   \vspace*{-10pt}
   \caption{\textbf{(a):} The normalized histogram of pairwise filter similarities in standard ResNet34 with different regularizers. HouseHolder orthogonalization method shows the best performance of descending filter similarity.
            \textbf{(b):} Capsnet with different Orthogonal regularization on MNIST dataset. Our HouseHolder orthogonalization method reaches better orthogonality and loss decay.}
   \label{orth}         
\end{figure}

\subsection{Capsule Network}

\subsubsection{Dynamic Routing}

Algorithm \ref{algo:dynamic_routing} describes the dynamic routing algorithm. 
This algorithm allows lower-level capsule output vectors to be allocated to higher-level capsules based on their similarity, 
thereby achieving an adaptive feature combination. 
However, as evident from $\sum_i c_{ij} \hat{u}_{j|i}$,
each higher-level capsule is a weighted sum of lower-level capsules.
The higher-level capsules are fully connected with the lower level.
Furthermore, the routing algorithm fundamentally serves as an unsupervised clustering process for capsules, 
requiring $r$ iterations to converge the coupling coefficients $c$. 
It's crucial to strike a balance in choosing $r$: an inadequate number of iterations may hinder convergence of $c$, impairing routing efficacy, 
while an excessive count increases computational demands.
\begin{algorithm}
   \caption{Dynamic Routing}
   \label{algo:dynamic_routing}
   \begin{algorithmic}[1]
      \Procedure{ROUTING}{$\hat{u}_{j|i}$, $r$, $l$}
         \For{all capsule $i$ in layer $l$ and capsule $j$ in layer $(l + 1)$} $b_{ij} \leftarrow 0$
         \EndFor
         \For{$T$ iterations}
            \For{all capsule $i$ in layer $l$} $c_i \leftarrow \text{softmax}(b_i)$ 
            \EndFor
            \For{all capsule $j$ in layer $(l + 1)$} $s_j \leftarrow \sum_i c_{ij} \hat{u}_{j|i}$
            \EndFor
            \For{all capsule $j$ in layer $(l + 1)$} $v_j \leftarrow \text{squash}(s_j)$ 
            \EndFor
            \For{all capsule $i$ in layer $l$ and capsule $j$ in layer $(l + 1)$} $b_{ij} \leftarrow b_{ij} + \hat{u}_{j|i} \cdot v_j$
            \EndFor
         \EndFor
         \State \Return $v_j$
      \EndProcedure
   \end{algorithmic}
\end{algorithm}
In Conclusion, it is crucial to introduce a straightforward, iterative-free routing algorithm.


\subsubsection{Squash Activation Function}\label{A.4.2} 

Figure(\ref{squash}) displays the functions and their derivatives for both sigmoid and squash. 
The x-axis represents the $L_2$-norm of the vote from routing, 
serving as the function input, 
while the y-axis denotes the function values and their respective derivatives.

\begin{figure}[h]
   \begin{center}
   \includegraphics[width=0.9\textwidth]{./figs/sigmoid.png}
   % \fbox{\rule[-.5cm]{0cm}{4cm} \rule[-.5cm]{4cm}{0cm}}
   \end{center}
   \caption{Comparison of function and derivative figure of sigmoid and squash. \textbf{Left:} Function figure of sigmoid and squash. 
            \textbf{Right:} Derivative figure of sigmoid and squash.}
   \label{squash}
\end{figure}


The Squash function is defined as:

\[v_j = \frac{{||s_j||^2}}{{1 + ||s_j||^2}} \frac{{s_j}}{{||s_j||}}\]

where \(v_j\) is the output vector, \(s_j\) is the input vector, and \(||s_j||\) represents the L2 norm of the input vector.

When the norm \(||s_j||\) of the input vector \(s_j\) approaches zero or infinity, 
the output \(v_j\) of the Squash function tends to be zero,
due to the dominance of the denominator \(1+ ||s_j||\) in the former 
and the normalization by the vector magnitude $||s_j||^2$ in the latter. 
For intermediate magnitudes, the function undergoes a rapid transition. 
When \(||s_j||\) is near zero, 
the output \(v_j\) of the Squash function changes relatively rapidly, which is not conducive to gradient optimization and leads to unstable training.
The derivative approaches zero when the norm \(||s_j||\) of the input vector \(s_j\) approaches zero or infinity, 
leading to gradient vanishing issues.

Thus, it is necessary to design a new activation function to replace the Squash function for the deep Capsule Network.




\end{document}
