Abstract: The predictive uncertainty problem exists in Transformers. We present that pre-trained Transformers can be further regularized by employing mutual information to alleviate such issues in neural machine translation (NMT). In this paper, to enhance the representation, we explicitly capture the nonlinear mutual dependencies existing in two types of attention in the decoder to reduce the model uncertainty. Specifically, we employ mutual information to measure the nonlinear mutual dependencies of token-token interactions during attention calculation. Moreover, we resort to InfoNCE for mutual information estimation to avoid the intractable problem. By maximizing the mutual information among tokens, we capture more knowledge concerning token-token interactions from the training corpus to reduce the model uncertainty. Experimental results on WMT'14 En$\rightarrow$De and WMT'14 En$\rightarrow$Fr demonstrate the consistent effectiveness and evident improvements of our model over the strong baselines. Quantifying the model uncertainty again verifies our hypothesis. The proposed plug-and-play approach can be easily incorporated and deployed into pre-trained Transformer models. Code will be released soon.
Paper Type: long
0 Replies
Loading