Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling

Hongyu Gong; Yun Tang; Juan Pino; Xian Li

Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling

Hongyu Gong, Yun Tang, Juan Pino, Xian Li

Published: 09 Nov 2021, Last Modified: 05 May 2023NeurIPS 2021 PosterReaders: Everyone

Keywords: Multi-head attention, parameter sharing, speech, translation

Abstract: Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative interference across languages and domains. In this paper, we find that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains. We further propose attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Our approach automatically learns shared and specialized attention heads for different languages and domains. Evaluated in various tasks including speech recognition, text-to-text and speech-to-text translation, the proposed attention sharing strategies consistently bring gains to sequence models built upon multi-head attention. For speech-to-text translation, our approach yields an average of $+2.0$ BLEU over $13$ language directions in multilingual setting and $+2.0$ BLEU over $3$ domains in multi-domain setting.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

TL;DR: We propose attention head selection to mitigate interference within multi-head attention in multilingual and multi-domain modeling.

Supplementary Material: pdf

14 Replies

Loading