Length Generalization of Causal Transformers without Position Encoding

Anonymous

Length Generalization of Causal Transformers without Position Encoding

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that NoPE can extend to longer sequences than the commonly used explicit position encodings. Moreover, we propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which further expands NoPE's context size. Experiments on long sequence language modeling and the synthetic passkey retrieval task show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms.

Paper Type: long

Research Area: Efficient/Low-Resource Methods for NLP

Contribution Types: Model analysis & interpretability

Languages Studied: English

0 Replies

Loading