Extracting Finite State Machines from Transformers

Rik Adriaensen; Jaron Maene

Extracting Finite State Machines from Transformers

Rik Adriaensen, Jaron Maene

Published: 24 Jun 2024, Last Modified: 31 Jul 2024ICML 2024 MI Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: transformer, finite state machine, interpretability

TL;DR: We investigate transformers trained on regular languages from a mechanistic interpretability perspective and extract Moore machines from them.

Abstract: Fueled by the popularity of the transformer architecture in deep learning, several works have investigated what formal languages a transformer can learn. Nonetheless, existing results remain hard to compare and a fine-grained understanding of the trainability of transformers on regular languages is still lacking. We investigate transformers trained on regular languages from a mechanistic interpretability perspective. Using an extension of the $L^*$ algorithm, we extract Moore machines from transformers. We empirically find tighter lower bounds on the trainability of transformers, when a finite number of tokens determine the state. Additionally, our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation. However, we also identify failure cases, where the determining tokens get misrecognised due to saturation of the attention mechanism.

Submission Number: 131

Loading