Keywords: transformer, finite state machine, interpretability
TL;DR: We investigate transformers trained on regular languages from a mechanistic interpretability perspective and extract Moore machines from them.
Abstract: Fueled by the popularity of the transformer architecture in deep learning, several works have investigated what formal languages a transformer can learn. Nonetheless, existing results remain hard to compare and a fine-grained understanding of the trainability of transformers on regular languages is still lacking. We investigate transformers trained on regular languages from a mechanistic interpretability perspective. Using an extension of the $L^*$ algorithm, we extract Moore machines from transformers. We empirically find tighter lower bounds on the trainability of transformers, when a finite number of tokens determine the state. Additionally, our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation. However, we also identify failure cases, where the determining tokens get misrecognised due to saturation of the attention mechanism.
Submission Number: 131
Loading