On the Role of Unstructured Training Data in Transformers' In-Context Learning Capabilities

Kevin Christian Wibisono; Yixin Wang

On the Role of Unstructured Training Data in Transformers' In-Context Learning Capabilities

Kevin Christian Wibisono, Yixin Wang

Published: 07 Nov 2023, Last Modified: 13 Dec 2023M3L 2023 PosterEveryoneRevisionsBibTeX

Keywords: in-context learning, attention mechanism, softmax attention, linear attention, mixture of experts, transformers

TL;DR: This paper explores how transformers learn in-context when trained on unstructured data without known input-output pairings

Abstract: Transformers have exhibited impressive in-context learning (ICL) capabilities: they can generate predictions for new query inputs based on sequences of inputs and outputs (i.e., prompts) without parameter updates. Efforts to provide theoretical explanations for the emergence of these abilities have primarily focused on the structured data setting, where input-output pairings in the training data are known. This scenario can enable simplified transformers (e.g., ones comprising a single attention layer without the softmax activation) to achieve notable ICL performance. However, transformers are primarily trained on unstructured data that rarely include such input-output pairings. To better understand how ICL emerges, we propose to study transformers that are trained on unstructured data, namely data that lack prior knowledge of input-output pairings. This new setting elucidates the pivotal role of softmax attention in the robust ICL abilities of transformers, particularly those with a single attention layer. We posit that the significance of the softmax activation partially stems from the equivalence of softmax-based attention models with mixtures of experts, facilitating the implicit inference of input-output pairings in the test prompts. Additionally, a probing analysis reveals where these pairings are learned within the model. While subsequent layers predictably encode more information about these pairings, we find that even the first attention layer contains a significant amount of pairing information.

Submission Number: 77

Loading