Systematic Generalization and Emergent Structures in Transformers Trained on Structured TasksDownload PDF

Published: 21 Oct 2022, Last Modified: 05 May 2023Attention Workshop, NeurIPS 2022 PosterReaders: Everyone
Keywords: transformers, interpretability, multi-task learning, systematic generalization
TL;DR: We present a causal transformer that learns structured, algorithmic tasks and generalizes to longer sequences, and unpack its computation in relation to task structures by analyzing attention patterns and latent representations.
Abstract: Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. We searched for the layer and head configuration sufficient to solve the task, and performed attention ablation and analyzed encoded representations. We show that two-layer transformers learn generalizable solutions to multi-level problems, develop signs of systematic task decomposition, and exploit shared computation across related tasks. These results provide key insights into the possible structures of within-task and cross-task computations that stacks of attention layers can afford.
0 Replies

Loading