Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Length Generalization, Transformers, Position Coupling, Positional Encoding, Out-of-distribution Generalization, Arithmetic Tasks, Algorithmic Tasks
TL;DR: To tackle the length generalization problem of decoder-only Transformer for solving arithmetic/algorithmic tasks, we inject the structure of the task into the Transformer by using the same position IDs for relevant tokens.
Abstract: Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose *position coupling*, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, our models trained on 1 to 30-digit additions can generalize up to *200-digit* additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task. Our codebase is available at [github.com/HanseulJo/position-coupling](https://github.com/HanseulJo/position-coupling).
Primary Area: Natural language processing
Submission Number: 13307
Loading