KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Ta-Chung Chi; Ting-Han Fan; Peter Ramadge; Alexander Rudnicky

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Ta-Chung Chi, Ting-Han Fan, Peter Ramadge, Alexander Rudnicky

Published: 31 Oct 2022, Last Modified: 06 Apr 2025NeurIPS 2022 AcceptReaders: Everyone

Keywords: Transformer Language Modeling, Length Extrapolation, Kernel Method

Abstract: Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets. Our implementation and pretrained checkpoints are released at~\url{https://github.com/chijames/KERPLE.git}.

Supplementary Material: pdf

TL;DR: We showed that conditionally positive definite (CPD) kernels allow us to derive various relative positional embeddings (RPE) with superior performance on the task of transformer language modeling length extrapolation.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/kerple-kernelized-relative-positional/code)

32 Replies

Loading