Disentangled Mask Attention in TransformerDownload PDF

29 Sept 2021 (modified: 13 Feb 2023)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone
Keywords: sequential learning, mask attention, latent disentanglement, transformer
Abstract: Transformer conducts self attention which has achieved state-of-the-art performance in many applications. Multi-head attention in transformer basically gathers the features from individual tokens in input sequence to form the mapping to output sequence. There are twofold weaknesses in learning representation using transformer. First, due to the natural property that attention mechanism would mix up the features of different tokens in input and output sequences, it is likely that the representation of input tokens contains redundant information. Second, the patterns of attention weights between different heads tend to be similar, the representation capacity of the model might be bounded. To strengthen the sequential learning representation, this paper presents a new disentangled mask attention in transformer where the redundant features are reduced and the semantic information is enriched. Latent disentanglement in multi-head attention is learned. The attention weights are filtered by a mask which is optimized by semantic clustering. The proposed attention mechanism is implemented for sequential learning according to the clustered disentanglement objective. The experiments on machine translation show the merit of this disentangled transformer in sequence-to-sequence learning tasks.
One-sentence Summary: A new disentangled mask attention is proposed for sequence-to-sequence learning using transformer.
Supplementary Material: zip
5 Replies

Loading