Keywords: in-context learning; in-weight learning; dual-space; model architecture
TL;DR: We propose CoQE, a Transformer with dual-space context-query encoding, reconciling ICL and IWL, and achieving lower ICL error than standard Transformers across diverse tasks and data distributions.
Abstract: In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, prior studies have observed that ICL often exhibits a conflict with the model’s inherent in-weight learning (IWL) capability. In this work, we aim to reconcile ICL and IWL by disentangling the model’s encoding spaces for context and input samples. To do so, we first propose a dual-space modeling framework, explicitly modeling a task representation space via the dual space of the sample representation space. Such a dual-space structure can be derived from the linear representation hypothesis and, as we theoretically prove, is conducive to ICL by representation learning. Furthermore, we show that the standard Transformer architecture with softmax self-attention is inherently limited in realizing this structure. Building on this insight, we introduce CoQE, a Transformer architecture with separate context-query encoding, to realize the disentanglement between context and sample representations. Through experiments on both regression and classification tasks, we demonstrate that CoQE not only achieves lower ICL error compared to the standard Transformers, but also successfully reconciles ICL and IWL under diverse data distributions.
Supplementary Material: zip
Primary Area: transfer learning, meta learning, and lifelong learning
Submission Number: 1915
Loading