Keywords: Transformer, in-context learning, linear regression, temperature
Abstract: Pretrained Transformers exhibit strong in-context learning (ICL) capabilities, enabling them to perform new tasks from a few examples without parameter updates. However, their ICL performance often deteriorates under distribution shifts between pretraining and test-time data. Recent empirical work suggests that adjusting the attention temperature—a scaling factor in the softmax—can improve the performance of Transformers under such distribution shifts, yet its theoretical role remains poorly understood. In this work, we provide the first theoretical analysis of attention temperature in the context of ICL with pretrained Transformers. Focusing on a simplified setting with "linearized softmax" attention, we derive closed-form expressions for the generalization error under distribution shifts. Our analysis reveals that distributional changes in input covariance or label noise can significantly impair ICL, and that an optimal attention temperature exists which provably minimizes this error. We validate our theory through simulations on linear regression tasks and experiments with LLaMA2-7B on question-answering benchmarks. Our results establish attention temperature as a critical lever for robust in-context learning, offering both theoretical insight and practical guidance for tuning pretrained Transformers under distribution shift.
Supplementary Material: zip
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 28124
Loading