Lost or Liberated? A Dive into Bidirectional Transformer LMs Without Positional Encoding

Published: 19 Mar 2024, Last Modified: 22 Apr 2024Tiny Papers @ ICLR 2024 PresentEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Transformers, Positional encoding
TL;DR: Transformer encoder (Bert-like) without positional encoding.
Abstract: Recent studies have shown that Autoregressive Transformer Language Models (LMs) can generate text sequences without relying on positional encodings (PEs). This capability is attributed to the causal masks in these models, which prevent tokens from accessing information from future tokens, allowing implicit learning of token positions. On the other hand, Bidirectional LMs, such as BERT, tend to under-perform on masked language modeling tasks when PEs are omitted. This performance dip arises because transformer layers are inherently permutation equivariant; without PEs, they cannot differentiate token positions, making bidirectional processing difficult. In this study, we examine a variant of bidirectional Transformer LM that operates without PEs but incorporates causal masks in its initial layers. Our findings reveal that this configuration yields masked language modeling losses comparable to traditional transformers that use PEs. However, when tested on the GLUE language understanding benchmark, the model without PEs exhibits diminished performance. These results highlight the importance of positional encodings in bidirectional LMs and indicate that pretraining loss might not always correlate with performance on downstream tasks.
Submission Number: 24
Loading