This paper introduces the Correlated Gaussian Process Transformer (CGPT), a new framework to calibrate uncertainty for transformers.
CGPT leverages a novel CGP representation, which allows us to draw connection between the output of kernel attention often used in transformers and inference using cross-covariance between two correlated GPs defined through a latent canonical GP. With this formulation, our cross-covariance function does not have to be a symmetric kernel, which is a condition imposed on existing GP-based transformers in exchange for uncertainty calibration.
Therefore, our framework preserves the flexibility in the representation capacity of attention by allowing asymmetries in the attention unit while being able to fully utilize the uncertainty estimation ability of GPs. 
% Our experiments show that CGPT achieves better performance in both accuracy and calibration metrics than the existing baselines.
Improving the efficiency of CGPT using random features or sparse GP is an interesting future work to explore.