Abstract: Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that NoPE can extend to longer sequences than the commonly used explicit position encodings. Moreover, we propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which further expands NoPE's context size. Experiments on long sequence language modeling and the synthetic passkey retrieval task show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: Model analysis & interpretability
Languages Studied: English
0 Replies
Loading