% Summarize Transformer => Background TF, GP => problem of symmetric TF => question => method => contribution

\par Transformers \citep{vaswani2017attention} have recently emerged as the preferred models in various sequence modeling tasks, including those in computer vision~\citep{al2019character,dosovitskiy2020image,ramesh2021zero,radford2021learning,9710415,liu2021video,zhao2021point,guo2021pct}, natural language processing~\citep{baevski2018adaptive,dehghani2018universal,devlin2018bert,al2019character,dai2019transformer,NEURIPS2020_1457c0d6,brown2020language}, and reinforcement learning~\citep{chen2021decision,janner2021offline}, due to its computational advantage in replacing expensive recurrence operations in recurrent neural networks~\citep{medsker2001recurrent} and long short-term memory (LSTM) networks~\citep{hochreiter1997long}  with a feed-forward attention mechanism that allows for significantly more parallelization in model training~\citep{lin2022survey,tay2022efficient}. Transformer-based pre-trained models can also be effectively adapted to new tasks with limited supervision~\citep{radford2018improving,radford2019language,devlin2018bert,yang2019xlnet,liu2019roberta}. In particular, the core component of a transformer model is the multi-head self-attention (MHSA), which captures sequential dependencies among different tokens of a single sequence by having each token represented as weighted average over (learnable) functions of other tokens whereas the (learnable) weights characterize the similarities between tokens~\citep{cho-etal-2014-learning,parikh-etal-2016-decomposable,DBLP:journals/corr/LinFSYXZB17}. Intuitively, such weights represent the amount of attention each token needs to give others to obtain its contextual representation~\citep{bahdanau2014neural,vaswani2017attention,kim2017structured}.

\par Despite its growing successes, the original transformer lacks a mechanism for uncertainty calibration which is essential to provide trustworthy predictions and enable robust, risk-averse decision making in safety-critical tasks~\citep{chen2023calibrating}. This limitation has motivated a recent line of work \citep{xue2021bayesian, tran2019bayesian} that develops uncertainty quantification techniques for transformers. Particularly, the most recent work of \citep{chen2023calibrating} in this direction has drawn a connection between the self-attention mechanism and the inference mechanism of a GP \citep{Rasmussen06}, which interestingly appears to share the same principle of building a contextual representation for an input based on its similarities to other inputs. 

This is perhaps not surprising in hindsight considering that GPs had been historically adopted to model various forms of spatio-temporal data \citep{luttinen2012efficient, hamelijnck2021spatio} and their inherent temporal dependencies based on a kernelized measure of input similarities. This also has a direct translation to the attention mechanism from the lens of kernel attention \citep{tsai2019transformer, chen2024primal}, albeit with no uncertainty quantification. Despite this interesting progress in connecting the modern literature of transformers to the classical research on GPs for uncertainty quantification, prior work~\citep{chen2023calibrating} in this direction has to compromise the representational capacity of transformers in order to make such a connection. In particular, both linear transformation functions for the query and value vectors of the self-attention in transformers would have to be tied to the same parameterization to cast attention output as the output of a GP with a valid symmetric kernel. This constraint reduces the model's performance significantly, as shown in our experiments, which is consistent with the results for original transformers reported in~\citep{tsai2019transformer}.

%\footnote{The work of \citep{chen2023calibrating} has also shown that the output of kernel attention can be cast as the GP prediction with appropriate parameterization of transformation for queries, keys and values.}

{\bf Our Contribution.} To mitigate such restriction, we introduce a new perspective of GP-based transformer which preserves the original modeling flexibility of self-attention, allowing the attention matrix to be asymmetrical as needed. This is achieved via characterizing the attention output not as a GP prediction but as a prediction based on cross-covariance between two CGPs, which allows kernel asymmetries while retaining the quantifiable uncertainty structure of GPs. To substantiate the above high-level idea, we have made the following technical contributions:

{\bf 1.} In section \ref{sec:method}, we derive a correspondence between the self-attention units of the multi-head self-attention (MHSA) mechanism and cross-covariance between two CGPs modeled as different affine transformations of a common latent GP where one GP distributes the function of queries while the other distributes the function of keys. 
% The correlation structure is inspired by the classical work of~\citep{aueb2013variational}.  A new framework of Correlated Gaussian Process Transformer (CGPT) can then be created by replacing the original attention unit with the new CGP-based attention block.

%Both GPs and their correlating structure are trained based on the data derived from the output of the previous attention block. We then propose the novel Correlated Gaussian Process Transformer (CGPT), the transformer that utilizes the correlation structure between two CGPs.
 
{\bf 2.} In Section \ref{sec: SCGPT} we derive a sparse approximation to the above CGP-based attention unit, which removes the cubic dependence of CGPT's processing cost on the number of input tokens. We further develop a loss regularizer in terms of the log marginal likelihood of the CGPs (or sparse CGPs) which augments the conventional training loss of the transformer to balance between minimizing the uncertainty of attention output (i.e., CGP's prediction) and minimizing its induced classification performance.

{\bf 3.} In Section \ref{sec:experiments}, we empirically demonstrate that both our CGPT and its sparse approximation achieve better predictive performance than state-of-the-art kernel-attention and GP-based transformers across multiple computer vision and natural language processing benchmarks. 
% In addition, CGPT also outperforms existing transformer baselines significantly in terms of uncertainty calibration in both in-distribution and out-of-distribution settings across all benchmark datasets. For better scalability, CGPT can be replaced by its sparse approximation in exchange for a (expected) slight decrease in uncertainty calibration. 

% For better clarity, we include a concise review of MHSA used in Transformers and its GP modeling equivalence in Section 2 below.

% \par \textbf{Organization.} We organize the paper as follows. Section~\ref{sec:background} provides background on the self-attention mechanism in transformers and preliminaries on GPs. Section~\ref{sec: attention as GP} highlights the connection between self-attention and GP inference, which imposes a symmetric structure on the structure of kernel attention. To mitigate this, Section~\ref{sec:method} introduces a new correlated GP (CGP) framework for uncertainty quantification whose inference structure can be leveraged to accommodate asymmetries in kernel attention. Its prediction can also be shown to correspond to the output of kernel attention. Section~\ref{sec: SCGPT} derives its sparse approximation for better scalability. Section~\ref{sec:experiments} presents extensive empirical results demonstrating the performance advantages of our CGPT method over existing transformer baselines. For interested readers, we also provide succinct review of related works in Section \ref{sec:related_work}. For the interested readers, additional derivations and experiment results are provided in the appendix.

\par \textbf{Notation.} For the rest of this paper, we will use the notation $\kappa(\cdot, \cdot)$ to denote a kernel function. The kernel matrices are denoted as $\mathcal{K}$. All other matrices are denoted as bold uppercase letters. We also denote a Gaussian distribution with mean $\mu$ and variance $\sigma^2$ using the notation $\mathbb{N}(\mu, \sigma^2)$. We also use subscripts to distinguish between contexts in which those kernel function and matrices are computed.


%which include most notably, the pioneering work of \cite{tsai2019transformer} that establishes 

% Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel

%\par Due to the widespread use of Transformers, it is of importance to calibrate its modelling uncertainty in order to design robust and reliable transformer models. To this end, previous works have introduced Gaussian Processes (GPs) \cite{} to deep learning model construction \cite{} and extended GPs methods to Transformers \cite{}. Despite initial successes, there are restrictions and drawbacks in the design of these models that hinder the representation capacity of the models.

