Recent works have aimed to calibrate transformers using Bayesian approaches. \citet{fan2020bayesian} and ~\citet{cinquin2021pathologies} apply variational inference to the attention matrices. \citet{liu2020simple}, ~\citet{bradshaw2017adversarial} suggests fitting a GP on the output of the last attention layer. Another work utilizing GP was proposed by ~\citep{chen2023calibrating} that fits a sparse variational GP to each attention layer and propagates uncertainty across the layers. CGPT extends this research direction by fitting correlated GPs to the attention outputs.%\vspace{-2mm} 
Additionally, convolutional and recurrent neural networks, have benefited from the application of Bayesian approaches ~\citep{mukhoti2018evaluating, kendall2017uncertainties, gustafsson2020evaluating, chien2015bayesian, ritter2021sparse, tran2019bayesian}, and early efforts to employ similar methods for transformers have attained initial successes ~\citep{xue2021bayesian}. Another line of work by ~\citep{muller2021transformers} make the connection between transformers and Bayesian inference, showing that transformers can efficiently do Bayesian inference. Our proposed CGPT is complementary to those methods.