Masked Language Model with ALiBi and CLAP head

Published: 16 Feb 2024, Last Modified: 28 Mar 2024BT@ICLR2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: alibi, positional encoding, masked language model, BERT, RoBERTa
Blogpost Url:
Abstract: As a new approach to positional encoding, Attention with Linear Biases (ALiBi) uses linear biases of the attention weights to encode positional information, with capability of context length extrapolation. In their paper however, Press et al. focus on the perplexity of autoregressive decoder-only language models, leaving the question of downstream tasks and its applicability to encoder-attention open. In this blogpost, we attempt to bridge the gap by testing masked language models (MLMs) with encoder-attention ALiBi and prediction head similar to the counterparts of the original ALiBi models. We find that while simplified prediction head may be beneficial, performance of MLMs with encoder-attention ALiBi starts to deteriorate with 2048 sequence length at larger scales. We put our results in the context of related recent experiments and tentatively identify the circumstances more challenging to positional encoding designs. Finally, we open-source our MLMs, with BERT-level performance and 2048 context length.
Ref Papers:
Id Of The Authors Of The Papers: ~Ofir_Press1, ~Noah_A._Smith2, ~Mike_Lewis1
Conflict Of Interest: I declare no conflict of interest with the papers cited by this blogpost.
Submission Number: 8