Transformers Learn to Compress Variable-order Markov Chains in-Context

Published: 09 Oct 2024, Last Modified: 19 Nov 2024Compression Workshop @ NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: In-context learning; Variable-order Markov chain; Context Tree Weighting
TL;DR: We study how does trained Transformers perform context learning for variable-order Markov chain sources from both empirical and theoretical perspective by drawing connections to optimal loss compression algorithms in information theory literature.
Abstract: In recent years, large language models (LLMs) have demonstrated impressive in-context learning (ICL) capability. However, it is still unclear how the underlying transformers accomplish it, especially in more complex scenarios. Toward this goal, several recent works studied how transformers learn fixed-order Markov chains (FOMC) in context, yet natural languages are more suitably modeled by variable-order Markov chains (VOMC), i.e., context trees (CTs). In this work, we study the ICL of VOMCs by viewing language modeling as a form of data compression and focusing on small alphabets and low-order VOMCs. This perspective allows us to leverage mature compression algorithms, such as the context-tree weighting (CTW) algorithm as a baseline, which is Bayesian optimal for a class of priors. We empirically observe that the performance of transformers is not very sensitive to the number of layers, and even a two-layer transformer can learn in context quite well, tracking closely the performance of CTW. We provide a construction with $D+2$ layers that can mimic the CTW algorithm accurately for VOMCs of maximum order $D$. One distinction from the FOMC setting is that a counting mechanism plays an important role in this setting.
Submission Number: 77
Loading