Bidirectional Attention as a Mixture of Continuous Word Experts

Kevin Christian Wibisono; Yixin Wang

Bidirectional Attention as a Mixture of Continuous Word Experts

Kevin Christian Wibisono, Yixin Wang

Published: 08 May 2023, Last Modified: 22 Jun 2025UAI 2023Readers: Everyone

Keywords: self-attention, bidirectional attention, mixture of experts, large language models, position encodings, masked language model, exponential family embeddings

TL;DR: This paper shows that fitting bidirectional attention is equivalent to fitting a continuous bag of words model with mixture-of-experts weights.

Abstract: Bidirectional attention—composed of the neural network architecture of self-attention with positional encodings, together with the masked language model (MLM) objective—has emerged as a key component of modern large language models (LLMs). Despite its empirical success, few studies have examined its statistical underpinnings: What statistical model is bidirectional attention implicitly fitting? What sets it apart from its non-attention predecessors? We explore these questions in this paper. The key observation is that fitting a single-layer single-head bidirectional attention, upon reparameterization, is equivalent to fitting a continuous bag of words (CBOW) model with mixture-of-experts (MoE) weights. Further, bidirectional attention with multiple heads and multiple layers is equivalent to stacked MoEs and a mixture of MoEs, respectively. This statistical viewpoint reveals the distinct use of MoE in bidirectional attention, which aligns with its practical effectiveness in handling heterogeneous data. It also suggests an immediate extension to categorical tabular data, if we view each word location in a sentence as a tabular feature. Across empirical studies, we find that this extension outperforms existing tabular extensions of transformers in out-of-distribution (OOD) generalization. Finally, this statistical perspective of bidirectional attention enables us to theoretically characterize when linear word analogies are present in its word embeddings. These analyses show that bidirectional attention can require much stronger assumptions to exhibit linear word analogies than its non-attention predecessors.

Supplementary Material: pdf

Other Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/bidirectional-attention-as-a-mixture-of/code)

0 Replies

Loading