Remixers: A Mixer-Transformer Architecture with Compositional Operators for Natural Language Understanding

Anonymous

Remixers: A Mixer-Transformer Architecture with Compositional Operators for Natural Language Understanding

Anonymous

17 Sept 2021 (modified: 05 May 2023)ACL ARR 2021 September Blind SubmissionReaders: Everyone

Abstract: Recent work such as MLP-Mixers (Tolstikhin et al.) have demonstrated the promise of All-MLP architectures. While All-MLP architectures have demonstrated reasonable performance in computer vision and garnered recent interest, we argue that making them effective in NLP applications is still an uphill battle. Hence, there may be no solid reason to drop the self-attention modules altogether. In this paper, we propose a new Mixer-Transformer architecture, showing that Transformers and Mixer models can be quite complementary indeed. Fundamentally, we show that Mixer models are capable of acting as persistent global memory (in a similar vein to standard MLPs) while being imbued with global receptive fields at the same time. Hence, interleaving sample-dependent and input-local self-attention with persistent Mixer modules can be an effective strategy. Additionally, we propose compositional remixing, a new way of baking compositional operators (multiplicative and subtractive composition) within the mixing process to improve the expressiveness of the model. This allows us to effectively model relationships between unmixed and mixed representations - an inductive bias that we postulate is powerful for NLU applications. Via extensive experiments on 14 challenging NLU datasets (e.g., SuperGLUE, entailment and compositional generalization), we show that the proposed architecture consistently outperforms a strong T5 baseline (Raffel et al.). We believe this work paves the way for more effective synergies between the two families of models.

0 Replies

Loading