What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

TMLR Paper2726 Authors

21 May 2024 (modified: 24 May 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In-Context Learning (ICL) ability has been found efficient across a wide range of applications, where the Large Language Models (LLM) learn to complete the tasks from the examples in the prompt without tuning the parameters. In this work, we conduct a comprehensive study to understand ICL from a statistical perspective. First, we show that the perfectly pretrained LLMs perform Bayesian Model Averaging (BMA) for ICL under a dynamic model of examples in the prompt. The average error analysis for ICL is then built for the perfectly pretrained LLMs with the analysis of BMA. Second, we demonstrate how the attention structure boosts the BMA implementation. With sufficient examples in the prompt, attention is proven to perform BMA under the Gaussian linear ICL model, which also motivates the explicit construction of the hidden concepts from the attention heads values. Finally, we analyze the pretraining behavior of LLMs. The pretraining error is decomposed as the generalization error and the approximation error, which are bounded separately. Then the ICL average error of the pretrained LLMs is shown to be the sum of $O(T^{-1})$ and the pretraining error. In addition, we analyze the ICL performance of the pretrained LLMs with misspecified examples. The theoretical findings are corroborated with the experimental results.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: N/A
Assigned Action Editor: ~Tom_Rainforth1
Submission Number: 2726
Loading