Transformer-CTP: Current Token Prediction Using Cross-Attention of Queries with Current Position Information

ACL ARR 2024 June Submission1383 Authors

14 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformer and pre-trained language models have advanced various tasks in artificial intelligence. Typically, transformer decoders for language generation are trained using LM loss. The LM loss function predicts the next token from the outputs of previous tokens. LM loss is trained such that transformer decoders can generate autoregressively. In addition, transformers use self-attention to derive outputs, which generally assigns a high attention weight to self-tokens. Therefore, the transformer decoder may over-focus on the token $t-1$ when predicting token $t$ because it predicts token $t$ through self-attention up to token $t-1$. The proposed method prevents the transformer decoder from overfocusing on token $t-1$ when predicting token $t$. Instead of predicting token $t$ using the output of token $t-1$, we use a new input to predict token $t$. We also add a CPT module to the transformer decoder, which prevents token $t-1$ from being used in the attention query by cross-attention using the new input. Moreover, we measured the performance of machine translation and document summarization to verify that the proposed methodology can mitigate overfocusing problem and improve the performance. In our experiments, the proposed methodology improved performance. Also, it can distribute the focused attention to a few specific tokens, including the self-token. The code for the experiment can be found on our GitHub.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: fine-tuning, abstractive summarisation, machine translation
Contribution Types: Model analysis & interpretability, Reproduction study
Languages Studied: English, French, German, Romanian
Submission Number: 1383
Loading