%\documentclass{uai2023} % for initial submission
\documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
                                         % Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams

\usepackage{times}
\usepackage{soul}
\usepackage{url}
\usepackage[utf8]{inputenc}
%\usepackage[hidelinks]{hyperref}

\usepackage[]{caption}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{booktabs}
\usepackage{bbold}

\usepackage{multirow}
\usepackage{physics}

% these packages are added by Kelvin
\usepackage{amsthm,amssymb} % define proof environment
\usepackage{soul} % to strike through text
%\usepackage[hidelinks]{hyperref}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\argmax}{arg\,max}


%to IEEE algorithm format
%\usepackage{algorithm}
%\usepackage{algorithmic}
%\usepackage{algpseudocode}
\usepackage[linesnumbered,ruled,vlined]{algorithm2e}
\newcommand\mycommfont[1]{\footnotesize\ttfamily\textcolor{blue}{#1}}
\SetCommentSty{mycommfont}

% Support for easy cross-referencing
\usepackage[capitalize]{cleveref}
\crefname{section}{Sec.}{Secs.}
\Crefname{section}{Section}{Sections}
\Crefname{table}{Table}{Tables}
\crefname{table}{Tab.}{Tabs.}

\newtheorem{innercustomgeneric}{\customgenericname}
\providecommand{\customgenericname}{}
\newcommand{\newcustomtheorem}[2]{%
  \newenvironment{#1}[1]
  {%
   \renewcommand\customgenericname{#2}%
   \renewcommand\theinnercustomgeneric{##1}%
   \innercustomgeneric
  }
  {\endinnercustomgeneric}
}
\newcustomtheorem{customthm}{Theorem}
\newcustomtheorem{customlem}{Lemma}
\newcustomtheorem{customexa}{Example}
\newcustomtheorem{customrem}{Remark}

\usepackage{soul}
\usepackage{comment}
\usepackage{xcolor}
\newcommand{\q}[1]{\textcolor{red}{#1}}
\newcommand{\g}[1]{\textcolor{blue}{#1}}
\newcommand{\kl}[1]{\textcolor{orange}{#1}}

% the following package is optional:
%\usepackage{latexsym}

% See https://www.overleaf.com/learn/latex/theorems_and_proofs
% for a nice explanation of how to define new theorems, but keep
% in mind that the amsthm package is already included in this
% template and that you must *not* alter the styling.
\newtheorem{example}{Example}
\newtheorem{theorem}{Theorem}
\newtheorem{definition}{Definition}
\newtheorem{lemma}{Lemma}
\newtheorem{remark}{Remark}
\newtheorem{property}{Property}
\newtheorem{proposition}{Proposition}
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Contrastive Learning for Supervised Graph Matching}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<gathika.ratnayaka@anu.edu.au>?Subject=Your UAI 2023 paper}{Gathika~Ratnayaka}{}}
\author[1]{{Qing~Wang}}
\author[2]{{Yang~Li}}

% Add affiliations after the authors
\affil[1]{%
    School of Computing\\
    Australian National University\\
    Canberra, Australia
}
\affil[2]{%
    School of Information Technology\\
    Deakin University\\
    Melbourne, Australia
}  
  \begin{document}
\maketitle

%%%%%%%%% ABSTRACT
\begin{abstract}
Deep graph matching techniques have shown promising results in recent years. In this work, we cast deep graph matching as a contrastive learning
task and introduce a new objective function for contrastive mapping to exploit the relationships between matches and non-matches. To this end, we develop a hardness attention mechanism to select negative samples which captures the relatedness and informativeness of positive and negative samples. Further, we propose a novel deep graph matching framework, \emph{Stable Graph Matching} (StableGM), which incorporates Sinkhorn ranking into a stable marriage algorithm to efficiently compute one-to-one node correspondences between
graphs. We prove that the proposed objective function for contrastive matching is both positive and negative informative, offering theoretical guarantees to achieve
dual-optimality in graph matching. We empirically verify the effectiveness of our proposed approach by conducting experiments on standard graph matching benchmarks.

\end{abstract}

%%%%%%%%% Introduction
\section{Introduction}
\label{sec:intro}

Given two graphs, graph matching aims to establish node correspondences between the graphs. It is fundamental to many real-world applications, such as computer vision \citep{vento2013graph}, bio-informatics \citep{sharan2006modeling}, and social networks \citep{zhang2019graph}. However, due to the intriguing combinatorial nature of graphs, graph matching has long been known to be an NP-hard problem \citep{cordella2004sub}. 

Early studies \citep{cho2010reweighted,zaslavskiy2008path} on graph matching primarily relied on handcrafting affinities of nodes and edges to leverage structural information from graphs. Graph matching was then formulated as a quadratic assignment problem (QAP) and solved by applying combinatorial optimization techniques~\citep{lawler1963quadratic,loiola2007survey}. As a result,  traditional graph matching methods often lack flexibility and generalizability, failing to capture complex interactions in real world applications. Recently, a number of deep graph matching methods have been proposed to address the limitations of traditional graph matching methods and they have achieved state-of-the-art performance~\citep{wang2019learning,yu2019learning,rolinek2020deep,gao2021deep,wang2021neural}. 


 The main idea behind deep graph matching methods is to build an end-to-end learning model which can extract affinities from graphs via differentiable optimisation for finding ``soft" node correspondences between graphs. This process often involves a spectral relaxation or a doubly-stochastic relaxation~\citep{zanfir2018deep,wang2020learning}. The learning objectives are usually formulated as a cross entropy loss to maximize affinities between pairs of matched nodes as per the ground truth while minimizing affinities between nodes that do not match as per the ground truth, e.g. permutation loss \citep{wang2019learning,yu2019learning} and binary cross entropy loss \citep{wang2021neural}.  However, these methods still suffer from undesired behaviours, such as class imbalance, overfitting, and numerical instability~\citep{yu2019learning,gao2021deep}. We observe that these issues are largely due to the inherent limitations of cross-entropy loss, which penalize classification errors based on an element-wise comparison between a predicted matching matrix and a ground truth matrix, while ignoring the relationships among matches and non-matches. 

On the other hand, contrastive loss has demonstrated superior performance in various self-supervised learning tasks 
\citep{wang2020understanding,wang2021understanding}. In a contrastive learning setting, given an anchor, there is one positive sample and multiple negative samples.
It has been shown that the quality of negative samples has a significant impact on learning \citep{xu2022negative}. Moreover, it has been shown that contrastive learning with explicit hard negative sampling, which penalizes hard negatives while ignoring easy (uninformative) negative samples during the training, often yields better results \citep{wang2021understanding}.

However, we observe that, not only the informativeness of negative samples but also the informativeness of positive samples play an important role in contrastive learning. Furthermore, for graph matching, the one-to-one mapping constraint is crucial, which poses the implication that for each node in a source graph there exists one true match and the others are all non-matches. Thus, unlike existing methods which either removed this constraint from their relaxation or only implicitly considered it in their learning, we explore the power of contrastive loss to
effectively incorporate the one-to-one mapping constraint into the learning objective through the choice of negative samples that are most informative with respect to each positive sample.

These observations motivate us to leverage contrastive learning in a supervised graph matching setting. In our work, if two nodes are matched with each other as per the ground truth, such a pair of nodes is considered as a positive sample and a pair of nodes that should not be matched with each other as per the ground truth is considered a negative sample. Within our framework, we propose an explicit hard negative sampling strategy called "hardness attention" to identify hard negative samples, i.e., negative samples that have a high chance of being falsely identified as a positive sample, and propose a new contrastive loss for graph matching.  We discuss the theoretical limitations that may occur when adapting the standard contrastive loss from the self-supervised literature \citep{wang2021understanding} to a supervised graph matching setting, and describe how the proposed contrastive loss function possesses theoretical properties that are suitable for graph matching. Moreover, we extend the stable marriage algorithm~\citep{mcvitie1970stable,gale1962college} to graph matching for computing the final one-to-one correspondence between two graphs.

\paragraph{Contributions}This work perceives supervised graph matching as a contrastive learning task. The main technical contributions of this work are summarized as follows.

\begin{itemize}
  \item We design a novel hard negative sampling strategy that is well suited for the supervised graph matching setting. This hard negative sampling strategy captures the relatedness and informativeness of positive and negative samples with respect to the corresponding true-matching node pairs.
  \item We propose a new contrastive loss function for supervised graph matching and theoretically analyze the properties of the proposed loss function. We show that the proposed contrastive loss function is both positive and negative informative.

    \item We develop a new deep graph matching architecture, called StableGM, which incorporates two key design choices: (1) The Sinkhorn algorithm is used to learn a preference ranking matrix in which each node is associated with a preference order of nodes for matching. (2) We extend the stable marriage algorithm to graph matching which takes a Sinkhorn ranking matrix as input and generates a matching under the one-to-one mapping constraint. We prove that  matchings generated by this algorithm are stable. 
  \item Theoretically, we prove that StableGM  guarantees the dual-optimality of matching solutions, i.e., nodes in both source and target graphs can achieve an optimal matching simultaneously based on Sinkhorn ranking. Empirically, we conduct experiments on real-world datasets to verify the effectiveness of the proposed graph matching framework.
\end{itemize}


\begin{figure*}
\centering
\includegraphics[width=0.8\textwidth]{gm_final-2-final-3.pdf}
\caption{Overview of the StableGM framework. Given two graphs, their node embeddings are obtained using a neural network (NN) and a cross-graph node-to-node affinity matrix is computed. Then, Sinkhorn normalization is performed to obtain a Sinkhorn ranking matrix. In the training phase, the Sinkhorn ranking matrix is compared with the ground truth matrix and hard negative sampling is performed via "Hardness Attention", e.g., a positive sample is highlighted in green while the corresponding negative samples are highlighted in red. 
In the test phase, the Stable Matching algorithm takes the Sinkhorn ranking matrix as input and yields the node-to-node correspondence between the two graphs.}
\end{figure*}



%%%%%%%%% Problem

\section{Background}
\label{sec:problem}
We represent a graph as $\mathcal{G} = (V,E,\mathbf{A}, \mathbf{X})$, where $V=\{1,\dots,|V|\}$ is the set of vertices, $E$ is the set of edges, $\mathbf{A}\in \mathbb{R}^{|V|\times|V|}$ is the adjacency matrix, and $\mathbf{X}\in \mathbb{R}^{|V|\times d}$ is a node feature matrix with feature vectors of dimension $d$. Given two graphs, a source graph  $\mathcal{G}_S = (V_S, E_S, \mathbf{A}_S, \mathbf{X}_S)$ and a target graph $\mathcal{G}_T=(V_T, E_T, \mathbf{A}_T,\mathbf{X}_T)$, let $|V_S|=m$ and $|V_T|=n$, and w.l.o.g. $m \leq n$.

Generally speaking, the graph matching problem is to find a node correspondence matrix $\mathbf{{M}} \in \{0,1\}^{m \times n}$ between $\mathcal{G}_S$ and $\mathcal{G}_T$ which satisfies the one-to-one mapping constraint $\sum_{j=1}^{n} \mathbf{{M}}_{i,j} = 1 $ $\forall {i\in V_S}$ and $\sum_{i=1}^{m} \mathbf{{M}}_{i,j} \leq 1$ $\forall j\in V_T$ and is optimal w.r.t. an objective function $f(\cdot)$: 

\begin{equation}    \argmin_{\mathbf{{M}}}f(\mathbf{{M}};\mathcal{G}_S,\mathcal{G}_T).
\end{equation}


In the supervised learning setting, $f(\cdot)$ is usually defined 
 in terms of a loss function $L(\cdot)$. Given two graphs $\mathcal{G}_S$ and $\mathcal{G}_T$, and the ground truth node correspondence matrix $\mathbf{M}^*$, a graph matching model is learnt to minimize $L(\cdot)$:

\begin{equation}\label{eq:gm}
\argmin_{\mathbf{M}}
L(\mathbf{M}^*,\mathbf{M};\mathcal{G}_S,\mathcal{G}_T).
\end{equation}
To make the model differentiable, $\mathbf{M}$ is usually a "soft" matching matrix, such as a doubly stochastic matrix~\citep{fogel2013convex}.




\paragraph{Problem formulation}
Let $\mathbb{G}$ denote a set of graphs on a finite number of nodes and $\mathbb{M}$ be the set of $m\times n$ matching matrices, i.e., $\mathbb{M}=\{\mathbf{M}\in\{0,1\}^{m\times n}|\sum_{j=1}^{n} \mathbf{{M}}_{i,j} = 1 \forall {i\in V_S}$ and $\sum_{i=1}^{m} \mathbf{{M}}_{i,j} \leq 1$ $\forall j\in V_T$. In this work, our goal is to develop a deep graph matching model $f_{\Theta}:\mathbb{G}\times \mathbb{G}\rightarrow \mathbb{M}$ by end-to-end training through optimizing a contrastive matching objective. 

%%%%%%%%% Methodology

\section{Stable Graph Matching}
\label{sec:method}

In this section we introduce a novel deep graph matching architecture, \emph{Stable Graph Matching} (StableGM). 

\paragraph{Overview of StableGM} Let $\mathbb{C}\subseteq\mathbb{R}^{m\times n}$ be a set of $m\!\times\! n$ cross-graph node-to-node affinity matrices, $\mathbb{S}$ be a set of $m\!\times\! n$ rectangular doubly stochastic matrices s.t. $\mathbb{S}=\{\mathbf{S}\in[0,1]^{m\times n}|\mathbf{S}\mathbf{1}_n=\mathbf{1}_m, \mathbf{S}^{\intercal  }\mathbf{1}_m\leq\mathbf{1}_n\}$. 

StableGM consists of three main components. 
\begin{itemize}
\item[(1)] \emph{Graph affinity encoding}: a permutation-equivariant, differentiable function $\psi_{encode}:\mathbb{G}\times \mathbb{G}\rightarrow \mathbb{C}$ learns to encode features of two input graphs into a cross-graph node-to-node affinity matrix; 
\item[(2)] \emph{Sinkhorn ranking}: a permutation-invariant, differentiable function $\psi_{rank}: \mathbb{C}\rightarrow \mathbb{S}$ produces a preference ranking matrix for nodes in the two input graphs, i.e., for each node in one graph, ranking the preference orders in terms of nodes in the other graph; 
\item[(3)] \emph{Stable matching:} a matching algorithm $\psi_{match}: \mathbb{S}\rightarrow \mathbb{M}$ produces a matching matrix (node-to-node correspondences) between two input graphs under the one-to-one mapping constraint.
\end{itemize}



\subsection{Graph Affinity Encoding}
Graph affinity encoding involves two steps. First, we apply a neural network to encode geometric affinities between nodes, leveraging their features and structural information, into node embeddings. 
Then, a cross-graph node-to-node affinity layer takes the node embedding matrices of two graphs and constructs a matrix  representing affinities between nodes in $\mathcal{G}_S$ and $\mathcal{G}_T$.

Concretely, a neural network  $\textsc{nn}_{\theta}: \mathbb{G}\rightarrow \mathbb{R}^{|V|\times d}$, parameterized by $\theta$, takes a graph $\mathcal{G}\in \mathbb{G}$ as input and returns its node embeddings $\textsc{nn}_{\theta}(\mathcal{G})\in \mathbb{R}^{|V|\times d}$. Given two graphs $\mathcal{G}_S$ and $\mathcal{G}_T$, we apply $\textsc{nn}_{\theta}$ to obtain the node embedding matrices of $\mathcal{G}_S$ and $\mathcal{G}_T$, respectively:
\begin{equation}
\textsc{nn}_{\theta}(\mathcal{G}_S) = \mathbf{H}^S \in \mathbb{R}^{m\times d} \text{ ; }
\textsc{nn}_{\theta}(\mathcal{G}_T) = \mathbf{H}^T \in \mathbb{R}^{n\times d}.       
\end{equation}
Then, the cross-graph node-to-node affinity matrix $\mathbf{C}\in \mathbb{R}^{m\times n}$ between $\mathcal{G}_S$ and $\mathcal{G}_T$ is derived s.t. $\mathbf{C}_{i,j}$ denotes the affinity between the emebdding of node $i$ in $\mathcal{G}_S$ and the embedding of node $j$ in $\mathcal{G}_T$.  
\begin{equation}
\label{eq:weighted_inner_product}
  \mathbf{C} = f_{\gamma}(\mathbf{H}^S,\mathbf{H}^T)
\end{equation}  
Here, $f_{\gamma}$ denotes an affinity layer parameterized by $\gamma$ that encodes the cross-graph node-to-node affinities between $\mathcal{G}_S$ and $\mathcal{G}_T$. Different techniques such as the inner product \citep{fey2020deep}, and the weighted inner product \citep{rolinek2020deep} can be used within an affinity layer to calculate the cross-graph node-to-node affinities. 


\subsection{Sinkhorn Ranking}
\label{subsec:Sinkhorn_Ranking}
Given a graph affinity matrix that captures node-to-node affinity scores between two graphs, we apply Sinkhorn normalization \citep{sinkhorn1964relationship}  to compute the ranking of nodes in terms of their preference orders for matching. 

Following \citep{wang2020learning},  
a nonnegative matrix $\mathbf{S}^0$ is first obtained via $\mathbf{S}^0_{i,j}= \exp({\mathbf{C}_{ij}}/{\alpha})$, where $\alpha>0$ is a parameter. In the case where the target graph is larger, i.e., $m < n$, the matrix $\mathbf{S}^{0}$ is padded into a square matrix with small positive values (e.g. $1^{-10}$). Then, the rows and columns of $\mathbf{S}^0$ are iteratively normalized into a rectangular doubly-stochastic matrix using an extended Sinkhorn algorithm~\citep{cour2007balanced,fey2020deep}. Specifically, let $\eta(\cdot)$ be the Sinkhorn operator, $\oslash$ be the elementwise Hadamard division, and $\mathbf{1_n}$ be a column vector of ones. Starting with $\eta^0(\hat{\mathbf{S}})=\mathbf{S}^0$, we have 
\begin{equation}
\eta^i(\hat{\mathbf{S}})=f_{row}(f_{col}(\eta^{i-1}(\hat{\mathbf{S}}))),
\end{equation} where 
\begin{align}
  \textit{f}_{row}(\hat{\mathbf{S}})=&\hat{\mathbf{S}} \oslash(\hat{\mathbf{S}}\mathbf{1_n}\mathbf{1_n}^\intercal)\\
\textit{f}_{col}(\hat{\mathbf{S}})=&\hat{\mathbf{S}} \oslash(\mathbf{1_n}\mathbf{1_n}^\intercal\hat{\mathbf{S}}).  
\end{align}
 Padded elements are discarded from the final output to obtain a rectangular doubly stochastic matrix $ \mathbf{S} \in \mathbb{S}$. It should be noted that $\mathbf{S}_{i,j}$ can be viewed as representing the normalized affinity between $i \in V_S$ and $j \in V_T$. 
 
 Within our StableGM framework, any node $i \in V_S$ has a preference order, in which all the nodes in $V_T$ are ranked based on their normalized affinities. Specifically, given any two nodes $j,k$ in $V_T$, if $\mathbf{S}_{i,j} > \mathbf{S}_{i,k}$, $j$ is ranked higher than  $k$ in $i$'s preference order. If $\mathbf{S}_{i,j} = \mathbf{S}_{i,k}$, the tie is broken arbitrarily to get a strict ranking order (either $j$ can be ranked higher to $k$ or $k$ can be ranked higher to $j$).  Any given node $j \in V_T$ also has a preference order over all the nodes in $V_S$.  Given any two nodes $i,k$ in $V_S$, if $\mathbf{S}_{i,j} > \mathbf{S}_{k,j}$, $i$ is ranked higher than  $k$ in $j$'s preference order.  If $\mathbf{S}_{i,j} = \mathbf{S}_{k,j}$, the tie is broken arbitrary. As the preference orders of the nodes are derived based on $\mathbf{S}$, we term $\mathbf{S}$ as a Sinkhorn ranking. 
 


\subsection{Stable Matching Algorithm}\label{subsec:stable-algorithm}


Based on the Sinkhorn ranking that yields preference orders of nodes for matching, we leverage the known results on the stable marriage problem~\citep{gale1962college,mcvitie1970stable} to efficiently generate graph matching. We begin with defining the notion of blocking pair.





\begin{definition}[Blocking pair] Given a Sinkhorn ranking $\mathbf{S}$ and a matching $\mathbf{M}$, a node pair $(i,j)$ with $i\in V_S$ and $j\in V_T$  is a \emph{blocking pair} w.r.t. $\mathbf{M}$ if the following conditions are all satisfied:
\begin{itemize}
\item $\mathbf{M}_{i,j}=0$;
    \item $\exists k\in V_S\backslash\{i\}$ $(\mathbf{M}_{k,j}=1\wedge\mathbf{S}_{i,j}>\mathbf{S}_{k,j})$;
    \item $\exists k'\in V_T\backslash\{j\}$ $(\mathbf{M}_{i,k'}=1\wedge\mathbf{S}_{i,j}>\mathbf{S}_{i,k'})$ $\vee$ {$\forall k'\in V_T\backslash\{j\}$ $\mathbf{M}_{i,k'}=0$}.
\end{itemize}
\end{definition}
Two non-matched nodes $i \in V_S$ and $j \in V_T$ become a blocking pair if $i$ has a higher rank with $j$ than the node that is matched with $j$, and $j$ has a higher rank with $i$ than the node (if any) that is matched with $i$. If there is at least one blocking pair w.r.t. $\mathbf{M}$, $\mathbf{M}$ is considered as being unstable.

    


\begin{definition}[Stable matching]
   A matching $\mathbf{M}$ is \emph{stable} w.r.t. a Sinkhorn ranking $\mathbf{S}$ iff $\mathbf{M}$ has no blocking pair.
\end{definition}

Algorithm \ref{alg:Stable_Marriage_Algorithm} describes our stable matching algorithm, which extends the Male Optimal Stable Marriage (MOSM) algorithm proposed in \citep{mcvitie1970stable} to the graph matching setting. Taking a Sinkhorn ranking $\mathbf{S}$ as input, $V_S$ and $V_T$ are assumed to correspond to the male and female parties in the stable marriage problem, respectively. Then, the stable matching algorithm returns a predicted matching matrix $\mathbf{M}\in \{0,1\}^{m\times n}$ that contains one-to-one node correspondences from $V_S$ to $V_T$.  Here, $\mathbf{M}_{i,j}=1$ means node $i \in V_S$ is predicted to match with $j \in V_T$ and vice versa.

The following lemma and theorem can be obtained. The proofs are available in the supplementary material. 
 \begin{lemma}
   Given a Sinkhorn ranking $\mathbf{S}$, Algorithm~\ref{alg:Stable_Marriage_Algorithm} can always return a matching $\mathbf{M}$ that satisfies the one-to-one mapping constraint: $\sum_{j=1}^{n} \mathbf{M}_{i,j} = 1 $ $\forall i\in V_S$ and $\sum_{i=1}^{m} \mathbf{M}_{i,j} \leq 1$ $\forall {j\in V_T}$.
 \end{lemma} 

\begin{theorem}
  Let $\mathbf{M}$ be a matching returned by Algorithm~\ref{alg:Stable_Marriage_Algorithm} over a Sinkhorn ranking $\mathbf{S}$. Then $\mathbf{M}$ is stable w.r.t. $\mathbf{S}$.
\end{theorem}

\paragraph{Time complexity analysis}In the implementation, the preference order of a node $i \in V_S$ can be obtained by sorting (i.e.,  \textit{argsort}) the $i^{th}$ row of $\mathbf{S}$. Similarly, the preference order of a node $j \in V_T$ can be obtained by performing \textit{argsort} operation on the $j^{th}$ column of $\mathbf{S}$. Thus, the time complexity of obtaining the preference order of each node in $V_S$ and $V_T$ is  $O(|V_S||V_T|log(|V_T|))$. After deriving the preference orders, the remaining steps in the stable matching algorithm are performed with a time complexity of $O(|V_S|^2)$. As a result, the time complexity of Algorithm \ref{alg:Stable_Marriage_Algorithm} is $O(|V_S||V_T|log(|V_T|))$. In contrast, the Hungarian algorithm~\citep{kuhn1955hungarian}, which is commonly used to compute the final node correspondence between two graphs~\citep{wang2021neural,gao2021deep,yu2019learning,saad2021graph}, has a time complexity of $O(|V_T|^3)$. 

It is worth noting that, in our stable matching algorithm, the column-wise and row-wise sorting operations can be parallelized when computing the preference orders of nodes. Therefore, in a parallelized setting, the time complexity of the stable matching algorithm can be further reduced.

\begin{algorithm}

\caption{Stable Matching algorithm}
\label{alg:Stable_Marriage_Algorithm}
\SetKwInput{KwInput}{Input}                % Set the Input
\SetKwInput{KwOutput}{Output}              % set the Output
\DontPrintSemicolon  
%  \KwInput{Preference rankings of each node in $V_S$ and $V_T$}
\KwInput{A Sinkhorn Ranking $\mathbf{S}$}
  \KwOutput{$\mathbf{M}$}
  Obtain preference order for each node in $V_S$\;
  Obtain preference order for each node in $V_T$\;
  \While{$\exists i \in V_S$ that is not-matched ($\mathbf{M}_{i,k}=0$ $ \forall k\in V_T $)}{
    $j =$ the highest ranked node in $i$'s preference order\;
    \If{$j$ is not-matched ($\mathbf{M}_{k,j}=0$ $ \forall k \in V_S $)}{
        $\mathbf{M}_{i,j} \leftarrow 1$
    }
    \Else{
    $i' =$  node that is currently matched with $j$ \;
    \If{$i$ is ranked higher than $i'$ in the preference order of $j$}{
        $\mathbf{M}_{i',j} \leftarrow 0$\;
        $\mathbf{M}_{i,j} \leftarrow 1$ \;
    }
    }
    remove $j$ from the preference order of $i$
    
  }
  Return $\mathbf{M}$

\end{algorithm}

%%%%%%%%% Objective

\section{Contrastive Matching}
\label{sec:objective}
In this section, we formulate a contrastive objective for graph matching. The key idea is to minimize a contrastive loss between positive samples (i.e., nodes that should be matched) and negative samples (i.e., nodes that should not be matched) of graph matching. However, despite the success of contrastive learning in many other domains, graph matching raises new challenges: 
\begin{itemize}
    \item[(1)] Although it is natural to regard each true-matching pair as a positive sample and each non-matching pair as a negative sample, it is unclear how to capture the relationship between positive samples and negative samples into a contrastive loss. 
    \item [(2)] It is important to come up with a strategy to identify hard negative samples, as contrastive learning with an explicit hard negative sampling strategy has often shown to produce better results~\citep{wang2021understanding}.
\end{itemize}
We address these challenges in this section.

 

 
\subsection{Hardness Attention}\label{subsec:hardness}

We first design a hard negative sampling strategy to select hard negative samples according to their relatedness and informativeness to true matches. 


Let $\pi: V_S \rightarrow V_T$ denote an injective function that maps every node $i\in V_S$ to a node $\pi(i) \in V_T$ as per the ground truth, i.e. $\pi(i)$ is the true match of $i$. We ground our design of a hard negative sampling strategy on two observations: (1) \emph{Relatedness:} Randomly selecting negative samples is hardly effective for graph matching. This is because, for a true match $(i,\pi(i))$, its corresponding   $\mathbf{S}_{i,\pi(i)}$ has little correlation with other $\mathbf{S}_{i',j'}$ if $i'\neq i$ and $j'\neq \pi(i
)$. However, if 
considering $\mathbf{S}_{ij'}$ with $j'\neq \pi(i)$ or $\mathbf{S}_{i'\pi(i)}$ with $i'\neq i$, they are closely correlated with $\mathbf{S}_{i,\pi(i)}$ due to the constraints $\mathbf{S}\mathbf{1}_n=\mathbf{1}_m$ and $\mathbf{S}^{\intercal  }\mathbf{1}_m\leq\mathbf{1}_n$ imposed on rectangular doubly-stochastic matrices in $\mathbb{S}$. Thus, node pairs corresponding to $\mathbf{S}_{ij'}$ with $j'\neq \pi(i)$ or $\mathbf{S}_{i'\pi(i)}$ with $i'\neq i$ may serve as negative samples of $\mathbf{S}_{i,\pi(i)}$.
(2) \emph{Informativeness:} Negative samples are often not equally informative. Specifically, among node pairs corresponding to $\mathbf{S}_{ij'}$ with $j'\neq \pi(i)$ or $\mathbf{S}_{i'\pi(i)}$ with $i'\neq i$, some may be more informative than the others with respect to $\mathbf{S}_{i,\pi(i)}$. 



 For 
 each node $i\in V_S$, we define a hardness attention matrix $\mathbf{Z}^{(i)} \in \{0,1\}^{m \times n}$ as

 \[
    \mathbf{Z}^{(i)}_{k,j}= 
\begin{cases}
    1 ,& \text{if }  k=i,j\neq \pi(i) \text{ and } \mathbf{S}_{k,j} \geq \mathbf{S}_{i,\pi(i)}-\beta \\
    1 ,& \text{if }  k\neq i,j=\pi(i) \text{  and  } \mathbf{S}_{k,j} \geq \mathbf{S}_{i,\pi(i)}-\beta \\
    0,              & \text{otherwise}
\end{cases}
\]

Here, $\beta \in [0,1]$ is a parameter that regulates the degree of hardness of negative samples to be selected. $\mathbf{Z}^{(i)}_{k,j}$ determines whether the $k$-th node in $\mathcal{G}_S$ and the $j$-th node in $\mathcal{G}_T$ should be selected as a hard negative sample with respect to the positive pair $(i,\pi(i))$ for training.




\subsection{Contrastive Matching Loss}\label{subsec:loss}
Given a Sinkhorn ranking $\mathbf{S}$ and a set of hardness attention matrices $\mathbf{Z}=\{\mathbf{Z}^{(i)}\}_{i\in V_S}$, we formulate a contrastive matching loss as: 
\begin{align}\label{eq:contrastive-loss-gm} 
{L} =\frac{1}{m} \sum_{i=1}^{m} - \frac{1}{2}\ln \left(\frac{f^{+}(i)}{1+f^{-}(i)}\right) 
\end{align}
where
\begin{align*} 
   f^{+}(i) =&\mathbf{S}^{2}_{i,\pi(i)}; \\
   f^{-}(i) =& \sum_{k=1}^{m}\sum_{j=1}^{n}\mathbf{Z}_{k,j}^{(i)}\mathbf{S}^{2}_{k,j}.
\end{align*} 
In these equations, $f^{+}(i)$ is the affinity of a positive sample associated with $i\in V_S$ and $f^{-}(i)$ is the sum of the affinities of corresponding hard negative samples, determined by the hardness attentions in $\mathbf{Z}^{(i)}$. 
For clarity, we use $L_i$ to denote the contrastive loss for each $i\in V_S$: 
\begin{equation}\label{eq:contrastive-loss-gm-single} 
L_{i}=-\frac{1}{2}\ln \left(\frac{f^{+}(i)}{1+f^{-}(i)}\right)
\end{equation}
This contrastive loss exhibits the following properties:
\begin{itemize}
\item For two nodes $\{i,i'\}\subseteq V_S$, if their sums of the affinities of the corresponding negative samples are the same, i.e. $f^-(i)=f^-(i')$, their losses are negatively correlated to the affinities of their positive samples. 


\begin{property}[Informativeness of positive samples]
\label{property-positive1} Let $\{i,i'\}\subseteq V_S$ with $f^-(i)=f^-(i')$. If $f^+(i)>f^+(i')$, then $L_i<L_{i'}$.
\end{property}


\item For two nodes $\{i,i'\}\subseteq V_S$, if the affinities of their positive samples are the same, i.e. $f^+(i)=f^+(i')$, their losses are positively correlated to the sums of the affinities of their negative samples. 



 \begin{property}[Informativeness of negative samples]\label{property-negative} Let $\{i,i'\}\subseteq V_S$ with $f^+(i)=f^+(i')$. If $f^-(i)>f^-(i')$, then $L_i>L_{i'}$.
 \end{property}
\end{itemize}

A loss function $L(\cdot)$ is said to be \emph{positive-informative} or \emph{negative-informative} if it satisfies Property~\ref{property-positive1} and Property~\ref{property-negative}, respectively. The following theorem can be easily proven. The proof is included in the supplementary material. 
\begin{theorem}
The contrastive loss of Eq.~\ref{eq:contrastive-loss-gm-single} is both positive-informative and negative-informative.
\end{theorem}

It should be noted that given a positive sample $(i,\pi(i))$ s.t. $i \in V_S$, it is possible that the sum of affinities of the related negative samples $f^-(i)$ reaches 0 when $\mathbf{Z}_{k,j}^{(i)}=0$ $\forall k \in V_S$ and $\forall j \in V_T$. In such a situation, the standard InfoNCE contrastive loss \citep{wang2021understanding} is not positive informative, which is undesirable for graph matching.



\subsection{Gradient Analysis}
\label{subsec:gradient_analysis}



We first analyze the gradients of our contrastive loss function defined in Eq.~\ref{eq:contrastive-loss-gm-single}.

Given any $i\in V_S$, the gradient of the contrastive matching loss with respect to its positive sample is,
\begin{equation}\label{eq:positive gradient}
\pdv[]{L_{i}}{\mathbf{S}_{i,\pi(i)}}=-\frac{1}{\mathbf{S}_{i,\pi(i)}}.
\end{equation}
Accordingly, for a negative sample of $i\in V_S$, the gradient of the contrastive matching loss with respect to $\mathbf{S}_{i,q}$, where $q \neq \pi(i)$, is
\begin{equation}\label{eq:negative gradient}
\pdv[]{L_{i}}{\mathbf{S}_{i,q}}=\frac{\mathbf{Z}^{(i)}_{iq}\mathbf{S}_{i,q}}{1+\sum_{k=1}^{m}\sum_{j=1}^{n}\mathbf{Z}_{k,j}^{(i)}\mathbf{S}^{2}_{k,j}}.
\end{equation}
Let $\{i, i'\}\subseteq V_S$ and $\{q, q'\}\subseteq V_T$. From Eq.~\ref{eq:positive gradient} and Eq.~\ref{eq:negative gradient}, we have the following results:

\begin{itemize}
    \item If $\mathbf{S}_{i, \pi(i)} < \mathbf{S}_{i', \pi(i')}$, then 
    \begin{equation}
    \label{eq: affinity_aware_gradient}
    |\pdv[]{L_{i}}{\mathbf{S}_{i,\pi(i)}}| > |\pdv[]{L_{i'}}{\mathbf{S}_{i',\pi(i')}}|
    \end{equation}
    \item If $\mathbf{Z}^{(i)}_{iq}\mathbf{S}_{i,q} > \mathbf{Z}^{(i)}_{iq'}\mathbf{S}_{i,q'}$, then 
    \begin{equation}
    \label{eq:hardness_aware_gradient}
    |\pdv[]{L_{i}}{\mathbf{S}_{i,q}}| > |\pdv[]{L_{i}}{\mathbf{S}_{i,q'}}|
    \end{equation}

\end{itemize}

Eq.~\ref{eq:positive gradient} and Eq.~\ref{eq: affinity_aware_gradient} show that a higher penalty is given to a positive sample with a lower affinity.
Further, Eq.~\ref{eq:negative gradient} and Eq.~\ref{eq:hardness_aware_gradient} show that the gradient w.r.t. a negative sample is hardness aware as hard negatives with higher affinities are penalized more than negative samples with lower affinities.  


\begin{remark}
From the gradient analysis perspective, the standard contrastive loss function defined in \citep{wang2021understanding} has some undesirable behaviors in the graph matching setting. Specifically,
because the magnitude of its gradients with respect to a positive sample is equal to the sum of the gradients of all the negative samples that are considered, a gradient will not be propagated if the total gradients from negative samples reach 0, even though the affinity of a positive sample may be further increased.  
\end{remark}




\section{Further Discussion}\label{sec:discussion}

Below, we discuss the optimality of matching achieved by our proposed method StableGM.
\begin{definition}[Optimal matching]
Let $\mathbf{S}$ be a Sinkhorn ranking. A matching $\mathbf{M}$ is \emph{$\mathcal{G}_S$-optimal} w.r.t. $\mathbf{S}$ if $\mathbf{M}_{i,j}=1$ implies $\mathbf{S}_{i,j}>\mathbf{S}_{i,k}$ for $\forall k\in V_T\backslash\{j\}$. Similarly, a matching $\mathbf{M}$ is \emph{$\mathcal{G}_T$-optimal} w.r.t. $\mathbf{S}$ if $\mathbf{M}_{i,j}=1$ implies $\mathbf{S}_{i,j}>\mathbf{S}_{k,j}$ for $\forall k\in V_S\backslash\{i\}$. 
\end{definition}

\begin{definition}[Dual-optimal matching]
A matching $\mathbf{M}$ is \emph{dual-optimal} w.r.t. $\mathbf{S}$ if and only if $\mathbf{M}$ is both $\mathcal{G}_S$-optimal and $\mathcal{G}_T$-optimal w.r.t. $\mathbf{S}$.
\end{definition}




\begin{definition}[R1-symmetry]\label{def:R1-S}A Sinkhorn ranking $\mathbf{S}$ is \emph{R1-symmetric} if  following conditions are both satisfied,

\begin{itemize}
    \item for any $i \in V_S$ there exists some $j \in V_T$ such that $\mathbf{S}_{i,j}>\mathbf{S}_{i,k}$ holds for all $ k\in V_T\backslash\{j\}$;
    \item if $\mathbf{S}_{i,j}>\mathbf{S}_{i,k}$ holds for all $k\in V_T\backslash\{j\}$, then $\mathbf{S}_{i,j}>\mathbf{S}_{k,j}$ also holds for all $k\in V_S\backslash\{i\}$.   
\end{itemize}


\end{definition}



Based on the definition of R1-symmetry of $\mathbf{S}$, we can prove the following theorem and proposition. The proof details are included in the supplementary material.

\begin{theorem}
Let $\mathbf{M}$ be a matching produced by StableGM. Then $\mathbf{M}$ is dual-optimal if $\mathbf{S}$ is R1-symmetric.
\end{theorem}


\begin{proposition}\label{pro:R1}
      Let $\mathbf{M}$ be a matching produced by StableGM. Then $\mathbf{M}$ is dual-optimal when the  contrastive loss function $L(\cdot)$ is {minimized}. 
\end{proposition}

It should be noted that, in its original setting, the MOSM algorithm for stable marriage assignment \citep{mcvitie1970stable} can only guarantee the optimality of solutions for one side, which is well known as the male optimality of stable marriage solutions~\citep{gusfield1989stable}. This is an undesired property for the graph matching problem since the optimality of a graph matching should be \emph{dual} in relation to both source and target graphs. Nonetheless, as shown in Proposition 1,
in a contrastive learning setting with a properly designed loss function, a model can be learned to achieve dual-optimality of graph matching using the Stable Matching algorithm.  


%%%%%%%%% Experiments

\begin{table*}[th!]
    \centering
\resizebox{\textwidth}{!}%
{
\begin{tabular}{l| c c c c c c c c c c c c c c c c c c c c |c} 
 \toprule
 Method & aero &	bike & bird	& boat	& bottle & bus & car & cat & chair & cow & table & dog & horse & mbike & person & plant & sheep	& sofa	& train	& tv & mean \\ 
 \toprule

GMN  & 41.6 & 59.6 & 60.3 & 48.0 & 79.2 & 70.2 & 67.4 & 64.9 & 39.2 & 61.3 & 66.9 & 59.8 & 61.1 & 59.8 & 37.2 & 78.2 & 68.0 & 49.9 & 84.2 & 91.4 & 62.4 \\

PCA-GM  & 49.8 & 61.9 & 65.3 & 57.2 & 78.8 & 75.6 & 64.7 & 69.7 & 41.6 & 63.4 & 50.7 & 67.1 & 66.7 & 61.6 & 44.5 & 81.2 & 67.8 & 59.2 & 78.5 & 90.4 & 64.8 \\

IPCA-GM  & 53.8 & 66.2 & 67.1 & 61.2 & 80.4 & 75.3 & 72.6 & 72.5 & 44.6 & 65.2 & 54.3 & 67.2 & 67.9 & 64.2 & 47.9 & 84.4 & 70.8 & 64.0 & 83.8 & 90.8 & 67.7 \\

CIE-H & 52.5	& 68.6	& 70.2	& 57.1	& 82.1	& 77	& 70.7 & 	73.1	& 43.8	& 69.9 &	62.4	& 70.2	& 70.3	& 66.4	& 47.6	& 85.3	& 71.7	& 64	& 83.9	& 91.7	& 68.9 \\

qc-DGM2  & 49.6 & 64.6 & 67.1 & 62.4 & 82.1 & 79.9 & 74.8 & 73.5 & 43.0 & 68.4 & 66.5 & 67.2 & 71.4 & 70.1 & 48.6 & 92.4 & 69.2 & 70.9 & 90.9 & 92.0 & 70.3 \\
 
 BBGM  & 61.9 & 71.1 & \underline{79.7} & 79.0 & 87.4 & 94.0 & 89.5 & 80.2 & 56.8 & 79.1 & 64.6 & 78.9 & 76.2 & 75.1 & 65.2 & 98.2 & 77.3 & 77.0 & 94.9 & \textbf{93.9} & 79.0 \\ 
 
 NGM-v2  & 61.8 & 71.2 & 77.6 & 78.8 & 87.3 & 93.6 & 87.7 & 79.8 & 55.4 & 77.8 & 89.5 & 78.8 & 80.1 & \textbf{79.2} & 62.6 & 97.7 & 77.7 & 75.7 & 96.7 & 93.2 & 80.1 \\
 
 NHGM-v2  & 59.9 & 71.5 & 77.2 & 79.0 & 87.7 & 94.6 & 89.0 & \textbf{81.8} & 60.0 & \underline{81.3} & \underline{87.0} & 78.1 & 76.5 & 77.5 & 64.4 & \textbf{98.7} & \underline{77.8} & 75.4 & 97.9 & 92.8 & 80.4 \\


 
GCAN  & 63.4 & 71.2 & \textbf{80.1} & \textbf{81.1} & \textbf{90.4} & \textbf{95.5} & \underline{89.5} & 80.4 & \textbf{65.3} & 80.8 & \textbf{89.9} & \underline{81.4} & \underline{80.6} & 78.1 & \textbf{67.7} & 98.2 & 77.5 & \textbf{82.6} & \underline{98.4} & 93.4 & \textbf{82.3} \\
 \hline
  StableGM-v1 & \underline{63.57}	& \textbf{72.85}	& 79.66	& \underline{81.02} &	\underline{88.73} &	\underline{94.71} & 88.81 & 78.41 &	59.96	& 79.14 &	84.33 &	80.48 &	78.43 &	\underline{78.45} &	63.90	& 97.90	& 77.77	& \underline{78.25}	& 98.00 & 93.24 & 80.88 \\ 
 StableGM-v2 & \textbf{65.07}	& \underline{72.51}	& 79.36	& 79.28 &	88.22 &	94.42 & \textbf{89.98} & \underline{81.3} &	\underline{65.05}	& \textbf{81.34} &	82.38 &	\textbf{82.25} &	\textbf{81.24} &	77.53 &	\underline{66.11}	& \underline{98.36}	& \textbf{79.44}	& 74.33	& \textbf{98.47} & \underline{93.71} & \underline{81.52} \\ 
\bottomrule
\end{tabular}%
}
\caption{Matching accuracy (\%) results on Pascal VOC Keypoint. Best results are in bold. Second best results are underlined.}
\label{tab:pascalVOC}
\end{table*}

\begin{table*}[htbp]
    \centering
     \resizebox{\textwidth}{!}
    {
    \begin{tabular}{l |c c c c c c c c c c c c c c c c c c| c } 
    \toprule
 Method  & aero &	bike & bird	& boat	& bottle & bus & car & cat & chair & cow  & dog & horse & mbike & person & plant & sheep & train & tv & mean \\ 
    \toprule
GMN  & 59.9	& 51 & 74.3	& 46.7	& 63.3	& 75.5	& 69.5	& 64.6	& 57.5	& 73	& 58.7	& 59.1	& 63.2	& 51.2	& 86.9	& 57.9	& 70 & 92.4	& 65.3 \\
PCA-GM  & 64.7	& 45.7	& 78.1	& 51.3	& 63.8	& 72.7	& 61.2	& 62.8	& 62.6	& 68.2	& 59.1	& 61.2	& 64.9	& 57.7	& 87.4	& 60.4	& 72.5	& 92.8	& 66 \\
IPCA-GM  & 69 & 52.9	& 80.4	& 54.3	& 66.5	& 80 & 68.5	& 71.4	& 61.4 & 74.8	& 66.3	& 65.1	& 69.6	& 63.9	& 91.1	& 65.4	& 82.9	& 97.5	& 71.2 \\
CIE-H & 71.5	& 57.1	& 81.7	& 56.7	& 67.9	& 82.5	& 73.4	& 74.5 & 62.6	& 78 & 68.7	& 66.3	& 73.7	& 66 & 92.5	& 67.2	& 82.3	& 97.5	& 73.3 \\
BBGM  & 72.50	& 64.55	& 87.80	& 75.81	& 69.27	& 93.95	& 88.59	& 79.92	& \textbf{74.56}	& 83.15	& 78.78	& \textbf{77.10}	& 76.50	& 76.34 & 98.20 & 85.54 & 96.78 & 99.31 & 82.15\\ 
NGM-v2  & 68.8	& 63.3	& 86.8	& 70.1	& 69.7	& 94.7	& 87.4	& 77.4	& 72.1 & 80.7	& 74.3	& 72.5	& 79.5	& 73.4	& 98.9	& 81.2	& 94.3	& 98.7	& 80.2 \\
NHGM-v2  & 62 & 57.8 & 86.4	& 68.5	& 68.7	& 93.4	& 80.8	& 76.6	& 69.2	& 79.9	& 66.2	& 71.7	& 78.1	& 69.5	& 98.2	& 84.4	& 93.2	& 99.3	& 78 \\  
GCAN  & 69.56	& 61.88	& \textbf{89.85}	& 75.21	& \textbf{70.41}	& \underline{97.22}	& 87.55	& \underline{80.39}	& 70.37	& 83.77	& 75.48	& 72.26	& \textbf{81.17}	& 75.52 & \textbf{99.71} & 86.07 & \textbf{97.75} & \textbf{99.94} & 81.90\\ 
    \hline
StableGM-v1 & \underline{74.20} & \underline{65.78} & 88.05 & \underline{76.46} & 69.60 & \textbf{97.70}	& \underline{90.18}	& \textbf{80.99}	& 71.53 & \textbf{85.18}	& \underline{78.13}	& \underline{74.61}	& 79.58	& \textbf{79.13}	& \underline{99.7}	& \textbf{87.09}	& \underline{97.53}	& 99.47	& \underline{83.05} \\
StableGM-v2 & \textbf{75.08} & \textbf{68.65} & \underline{89.50} & \textbf{76.86} & \underline{70.14} & 97.16	& \textbf{91.74}	& 79.15	& \underline{72.58} & \underline{84.56}	& \textbf{81.58}	& 74.06	& \underline{80.85}	& \underline{78.93}	& 99.04	& \underline{86.81}	& 96.10	& \underline{99.76}	& \textbf{83.48} \\
    \bottomrule
    \end{tabular}
    }
\caption{Matching accuracy (\%) results on SPair-71K. Best results are in bold. Second best results are underlined.}
\label{tab:spair71K}
\end{table*}


\begin{table}[th!]
    \centering
    \resizebox{0.4\textwidth}{!}{
 
    \begin{tabular}{l| c c c c c |c } 
 \toprule
 Method & car & duck & face & mbike & w-bottle & mean \\  
 \toprule
 GMN  & 67.9 & 76.7 & 99.8 & 69.2 & 83.1 & 79.3\\
 PCA-GM  & 87.6 & 83.6 & \textbf{100.0} & 77.6 & 88.4 & 87.4 \\
 IPCA-GM  & 90.4 & 88.6 & \textbf{100.0} & 83.0 & 88.3 & 90.1\\
 qc-DGM2 & \textbf{100.0} & \textbf{98.8} & 98.0 & 92.8 & 99.0 & 97.7 \\
 BBGM  & 96.8 & 89.9 & \textbf{100.0} & 99.8 & \underline{99.4} & 97.2 \\
 NGM-v2 & 97.4 & 93.4 & \textbf{100.0} & 98.6 & 98.3 & 97.5 \\
 NHGM-v2 & 97.4 & 93.9 & \textbf{100.0} & 98.6 & 98.9 & 97.8 \\
 GCAN & 98.8 & 94.1 & \textbf{100.0} & \textbf{100.0} & \textbf{100} & \textbf{98.6} \\
\hline
StableGM-v1 & \underline{98.85} & 94.23 & \textbf{100.0} & \textbf{100.0} & 99.23 & 98.46 \\
StableGM-v2 & \underline{98.85} & \underline{94.62} & \textbf{100.0} & 99.81 & 99.31 & \underline{98.52} \\
\bottomrule
\end{tabular}}
    \caption{Matching accuracy (\%) results on Willow ObjectClass. Best results are in bold. Second best results are underlined.}
    \label{tab:Willow}
\end{table}



\section{Experiments}\label{sec:experiment}




\subsection{Image Keypoints Matching}

\paragraph{Datasets}We conduct experiments on three image keypoint matching benchmarks widely used for learning-based graph matching: 
\textbf{1) Pascal VOC Keypoint with Berkeley annotations} \citep{everingham2010pascal}, which consists of images belonging to 20 classes with keypoints annotated. 
\textbf{2) SPair-71k} \citep{min2019spair}, which contains 70958 image pairs from Pascal VOC 2012 and Pascal 3D+ belonging to 18 classes. 
\textbf{3) Willow ObjectClass} \citep{cho2013learning} contains images that belong to 5 classes and images belonging to a class contain the same number of keypoints annotated. 


\paragraph{Baselines} We considered the following baselines: GMN \citep{zanfir2018deep}, PCA-GM \citep{wang2019learning}, IPCA-GM \citep{wang2020combinatorial}, CIE \citep{yu2019learning}, qc-DGM2 \citep{gao2021deep}, BBGM \citep{rolinek2020deep}, NGM-v2 \citep{wang2020learning}, NHGM-v2 \citep{wang2020learning}, and GCAN \citep{jiang2022graph}. 


\paragraph{Experimental setup}

For Pascal VOC Keypoint dataset, Willow ObjectClass dataset, and SPair-71k dataset, we followed the experimental setup in \citep{wang2021neural}\footnote{https://github.com/Thinklab-SJTU/ThinkMatch}. It is worth noting that GCAN \citep{jiang2022graph} followed the same experimental protocols as \citep{wang2021neural} for Pascal VOC Keypoint dataset and Willow ObjectClass dataset. However, for the SPair-71k dataset, GCAN \cite{jiang2022graph} followed the experimental setup of \citep{rolinek2020deep}, which filters out keypoints that are outside of the bounding box of an image \citep{wang2021neural}.
To consider keypoints that are outside of the bounding box of an image, we thus followed the experimental setup in \citep{wang2021neural}.



In order to extract visual features of annotated keypoints, we followed the same feature extraction method as suggested in \citep{wang2021neural,jiang2022graph} where the feature vectors corresponding to each keypoint were obtained by concatenating the outputs of $relu_{42}$ and $relu_{51}$ operations of a pre-trained VGG$_{16}$ model \citep{simonyan2014very}. Then,  for the input image pair, a graph corresponding to one annotated image was created by performing Delaunay triangulation \citep{delaunay1934sphere} on keypoints and the graph corresponding to the other annotated image is fully connected on keypoints, following the same protocols as in \citep{wang2021neural}. In evaluations, matching accuracy
is computed as the percentage of
correct matchings among all true matchings \citep{wang2021neural}.

\paragraph{StableGM models}We evaluate two variants of StableGM: StableGM-v1 and StableGM-v2. Both variants are trained using the StableGM framework described in Section~\ref{sec:method} Section~\ref{sec:objective}. They only differ in using techniques to learn node embeddings and to derive cross-graph node-to-node affinities. 

In StableGM-v1, the neural architecture proposed in BBGM \citep{rolinek2020deep} is used as the neural network $\textsc{nn}_{\theta}$ to learn node embeddings, where node embeddings of the key points are obtained using a two-layer SplineCNN \citep{fey2018splinecnn}. Moreover, in StableGM-v1, we use the affinity layer proposed in BBGM \citep{rolinek2020deep} to derive the cross-graph node-to-node affinity matrix $\mathbf{C}$, where node affinities are obtained using a weighted inner product. 

In StableGM-v2, we adapt the Graph Context Attention (GCA) mechanism proposed in GCAN \citep{jiang2022graph} to learn node embeddings and to obtain the cross-graph node-to-node affinity matrix. More specifically, after extracting the keypoint features of images, the self attention layer proposed in GCAN is used to learn node embeddings that encode graph structural and positional information. Then, the cross attention layer proposed in GCAN \citep{jiang2022graph} is used as the affinity layer to obtain cross-graph node-to-node affinities. 


\subsection{PPI Network Matching}

\paragraph{Datasets}We further considered a dataset of protein-protein interaction (PPI) network of yeasts with its noisy versions. This PPI network consists of 1004 proteins and 4920 high-confidence interactions among proteins~\citep{liu2021stochastic}. We match the PPI network with its three noisy versions, which contain 5\%, 15\%, and 25\% low-confidence interactions, respectively, in addition to high-confidence interactions~\citep{saraph2014magna}. This dataset has been used by previous studies \citep{xu2019gromov,liu2021stochastic} in evaluating graph matching approaches.

\paragraph{Baselines} We considered the following baselines: PISwap \citep{chindelevitch2013optimizing},  GHOST \citep{patro2012global}, MI-GRAAL \citep{kuchaiev2011integrative}, MAGNA++ \citep{vijayan2015magna++}, HubAlign \citep{hashemifar2014hubalign}, NETAL \citep{neyshabur2013netal}, CPD+Emb \citep{grover2016node2vec, myronenko2010point}, GWL+Emb \citep{xu2019gromov}, GWL \citep{xu2019scalable}, S-GWL \citep{xu2019scalable}, and SIGMA \citep{liu2021stochastic}.
  
\paragraph{Experimental setup} To conduct a fair comparison, we follow the setup proposed in SIGMA \citep{liu2021stochastic} and S-GWL \citep{xu2019scalable}. Following the setup of SIGMA, the input feature of each node is assigned based on its node degree. 
We use node correctness as the evaluation metric, calculated as the percentage of nodes that have the same matching as the ground truth~\citep{liu2021stochastic,xu2019scalable}. 


\paragraph{StableGM model} To evaluate our approach on the PPI network, the StableGM model was initialized as follows. A 5-layer Graph Isomorphism Network (GIN) was used as $\textsc{nn}_{\theta}$ to learn node embeddings. It should be noted that the same graph neural network architecture was used in SIGMA \citep{liu2021stochastic} to learn node embeddings. In the affinity layer of the StableGM model, we employ the same technique used in SIGMA for a fair comparison, which calculates the affinity between two nodes as the cosine similarity between the embeddings of the two nodes.
 



 



\subsection{Results and Discussion} 

As shown in Table \ref{tab:pascalVOC}, StableGM-v2 outperforms all other baselines in seven out of twenty classes of the Pascal VOC Keypoint dataset. Specifically, while our approach outperforms baselines for some classes (e.g., \textit{aero}, \textit{sheep}, \textit{bike}), it performs under par for some other classes. When observing the results on the SPair-71k dataset in Table \ref{tab:spair71K}, both StableGM-v1 and StableGM-v2 outperform all other baselines when it comes to mean accuracy over all the classes. It should be noted that, compared with the Pascal VOC Keypoint dataset, the SPair-71k dataset possesses several advantages, such as higher image quality, more reliable keypoint annotations, and removal of \textit{sofa} and \textit{table} class which are ambiguous and poorly annotated \citep{rolinek2020deep}. Indeed, it can be seen from Table \ref{tab:pascalVOC} that the mean accuracy of StableGM-v2 is less than GCAN primarily because StableGM-v2 performs poorly for \textit{sofa} and \textit{table} classes. 



For the Willow ObjectClass dataset, Table \ref{tab:Willow} shows that the mean accuracy results of StableGM models and GCAN are both high, close to 100. Their performance is generally comparable. The mean accuracy of GCAN is slightly higher than our models. However, StableGM-v2 outperforms GCAN in 2 classes while performing equally well in 1 class; StableGM-v1 outperforms GCAN in 2 classes while performing equally well in 2 classes.  


\begin{table}[th!]
    \centering
    \resizebox{0.33\textwidth}{!}{
    \begin{tabular}{l| c c c c c |c } 
 \toprule
 Method & Yeast 5\% & Yeast 15\% & Yeast 25\%  \\ 
 \toprule
 PISwap  & 0.1 & 0.1 & 0.0 \\
 GHOST  & 11.1   & 0.4  & 0.3  \\
 MI-GRAAL  & 18.0  & 6.9 & 5.2  \\
 MAGNA++ & 48.1 & 25.0 & 13.6 \\
 HubAlign & 50.0 & 35.2 & 12.9 \\
 NETAL & 6.9 & 0.9 & 1.0 \\
 CPD+Emb  & 3.6 & 2.1 & 2.0 \\
 GWL+Emb & 83.7 & \underline{66.6} & \underline{58.0} \\
 GWL & 82.4  & 65.34 & \textbf{58.8} \\
 S-GWL & 81.1 & 61.85 & 56.27 \\
 SIGMA & \underline{84.7$\pm${0.4}} & 57.4$\pm${1.1} & 41.4 $\pm${1.7} \\
\hline
StableGM & \textbf{86.1} $\pm$ \textbf{0.9} & \textbf{67.9} $\pm$ \textbf{1.1} & 57 $\pm$ 0.6\\ 
\bottomrule
\end{tabular}
}
    \caption{Node correctness (\%) results on the PPI dataset. The best results are in bold. The second best results are underlined.}
    \label{tab:PPI}
\end{table}



Table \ref{tab:PPI} shows that our StableGM model achieves superior performance compared to other baselines when matching the PPI network with its 5\% and 15\% noisy versions. However, when the noise level increases to 25\%, the model's performance becomes inferior to GWL+Emb \citep{xu2019gromov} and GWL \citep{xu2019scalable}. These empirical findings suggest that our method performs well under low noise settings but struggles with higher levels of noise. Similarly, when it comes to the image datasets, our method shows superior or comparable performance for most classes, but its performance on poorly annotated classes of Pascal VOC Keypoint (\emph{sofa} and \emph{table}) is considerably worse than some of the baselines. Thus, improving the proposed method's performance in settings with high noise or poor/ambiguous annotations could be an area for future research. 

We have further carried out experiments to demonstrate the effect of selecting the hardness attention parameter $\beta$ and the results are included in the supplementary material.



%%%%%%%%% Background

\section{Related Work}
\label{sec:background}

\paragraph{Deep Graph Matching}
In the literature of deep graph matching, the problem of graph matching has been primarily formulated in several ways: 1) Quadratic Assignment problem \citep{wang2019learning,yu2019learning,rolinek2020deep,wang2021neural}, 2) Stochastic Optimization Problem \citep{liu2021stochastic}, 3) Optimal Transport Problem \citep{yu2019learning}, and 4) Integer Linear Programming Problem \citep{jiang2022graph}. The study in \citep{zanfir2018deep} is considered the first attempt to use deep learning to solve graph matching. Graph Neural Networks (GNNs) were first employed for graph matching in the seminal work  \citep{wang2019learning}. The use of GNNs enables the formulation of graph matching as a linear assignment problem which is then solved using the Hungarian algorithm. BBGM \citep{rolinek2020deep} considered graph matching as a Quadratic Assignment Problem and adapted the technique of black box differentiation of combinatorial solvers \citep{vlastelica2019differentiation} to obtain the final node correspondence. The representation learning technique proposed in \citep{rolinek2020deep} for graph matching in keypoint annotated image datasets have been used in \citep{wang2021neural} to achieve the state-of-the-art results while considering graph matching as a constrained vertex classification task.      


\paragraph{Contrastive Learning}

A contrastive loss function was introduced in \citep{chopra2005learning}, in which the objective is to minimize the Euclidean distance between the feature vectors of positive samples as much as possible while increasing that of the negative samples more than a pre-defined distance. Subsequently, the concept of contrastive learning has been extended to different applications and various contrastive loss functions such as Triplet loss \citep{schroff2015facenet}, N-pair loss \citep{sohn2016improved} and InfoNCE loss \citep{oord2018representation} have been introduced.

Recently, contrastive learning has attracted significant attention due to its success in self-supervised learning \citep{wu2018unsupervised,he2020momentum, wang2021understanding}.
It has been shown that the performance of contrastive losses is enhanced when selecting explicit hard negative sampling \citep{xu2022negative}.
In \citep{khosla2020supervised}, the standard contrastive loss function has been adapted to the supervised classification setting by incorporating the information related to class labels.  The main difference of this setting, when compared with self-supervised contrastive learning, is the availability of many positive samples per anchor (in addition o the many negative samples), because the data instances from the same class are considered as positives. 

%%%%%%%%% Conclusion

\section{Conclusion}\label{sec:conclusion}

In this work, we introduced a new contrastive learning framework for graph matching, named StableGM. Within our contrastive learning framework, we proposed a novel hard negative sampling strategy and a new contrastive loss function that suits the graph matching setting. We provide a theoretical analysis of the proposed contrastive matching loss and showed that the proposed loss function possesses properties that can overcome the limitations that can occur when adapting standard contrastive loss function to our setting. Additionally, we described how our StableGM framework provides theoretical guarantees for optimal matching. We conducted experiments on benchmark datasets that are widely used for deep graph matching, and our empirical evaluation demonstrates the effectiveness of our approach. This work opens the direction to consider supervised graph matching as a contrastive learning problem and thereby adapt the concepts discussed in contrastive learning literature to the graph matching domain. 

Our code is available at GitHub: \url{https://github.com/Gathika94/StableGM.git}

% \subsection*{Acknowledgments}
% This work was partly supported by the Australian Research Council under Discovery Project DP210102273. We also would like to thank anonymous reviewers for their comments which helped improve the quality of the paper.


% \begin{contributions} % will be removed in pdf for initial submission 
% 					  % (without ‘accepted’ option in \documentclass)
%                       % so you can already fill it to test with the
%                       % ‘accepted’ class option
%     Briefly list author contributions. 
%     This is a nice way of making clear who did what and to give proper credit.
%     This section is optional.

%     H.~Q.~Bovik conceived the idea and wrote the paper.
%     Coauthor One created the code.
%     Coauthor Two created the figures.
% \end{contributions}

\begin{acknowledgements} % will be removed in pdf for initial submission,
% 						 % (without ‘accepted’ option in \documentclass)
%                          % so you can already fill it to test with the
%                          % ‘accepted’ class option
%     Briefly acknowledge people and organizations here.

%     \emph{All} acknowledgements go in this section.
This work was partly supported by the Australian Research Council under Discovery Project DP210102273. We also would like to thank anonymous reviewers for their comments which helped improve the quality of the paper.
\end{acknowledgements}

% References

\bibliography{ratnayaka_648}
\end{document}