TL;DR: We enhance the addressing ability of large language models by representing positional encoding with contextual information under equivariance constraints.
Abstract: Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con**T**extualized equivari**A**nt **P**osition **E**ncoding (**TAPE**), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving robustness and adaptability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques.
Lay Summary: Transformers (the AI behind ChatGPT) struggle with tasks requiring precise "positional awareness" – like solving math problems or finding specific details in long documents. This happens because their current methods for tracking word positions create rigid patterns, like forcing nearby words to always interact more strongly. This limits their ability to handle long texts or adapt to different tasks needing flexible positional understanding.
We developed TAPE (conTextualized equivariAnt Position Encoding), a smarter way to encode position. Unlike fixed methods, TAPE dynamically adjusts how it represents positions based on the actual content of the text as it processes it layer by layer. Crucially, it uses mathematical principles ("equivariance") to ensure these position updates stay stable and maintain correct relationships between words, even when the sequence order changes.
TAPE significantly boosts performance on tasks heavily reliant on position, like long addition (21.6% better accuracy than previous best) and retrieving hidden information from very long texts (near-perfect accuracy up to about 6,000 words). It also improves general language modeling for long contexts and can be easily added to existing models like Llama 2 with minimal extra cost. This makes transformers more capable and efficient for complex reasoning and long-context understanding.
Link To Code: https://github.com/VITA-Group/TAPE
Primary Area: Deep Learning->Large Language Models
Keywords: Positional Encoding, Large Language Models, Equivariant Machine Learning
Submission Number: 5294
Loading