Transformer Working Memory Enables Regular Language Reasoning And Natural Language Length Extrapolation

Ta-Chung Chi; Ting-Han Fan; Alexander Rudnicky; Peter Ramadge

Transformer Working Memory Enables Regular Language Reasoning And Natural Language Length Extrapolation

Ta-Chung Chi, Ting-Han Fan, Alexander Rudnicky, Peter Ramadge

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Language Modeling and Analysis of Language Models

Submission Track 2: Machine Learning for NLP

Keywords: Transformer, Algorithmic reasoning, Length extrapolation, Working memory

TL;DR: Transformer Working Memory Enables Regular Language Reasoning And Natural Language Length Extrapolation

Abstract: Unlike recurrent models, conventional wisdom has it that Transformers cannot perfectly model regular languages. Inspired by the notion of working memory, we propose a new Transformer variant named RegularGPT. With its novel combination of Weight-Sharing, Adaptive-Depth, and Sliding-Dilated-Attention, RegularGPT constructs working memory along the depth dimension, thereby enabling efficient and successful modeling of regular languages such as PARITY. We further test RegularGPT on the task of natural language length extrapolation and surprisingly find that it rediscovers the local windowed attention effect deemed necessary in prior work for length extrapolation.

Submission Number: 410

Loading