Keywords: large language model, in-context learning, theoretical deep learning
TL;DR: We theoretically show that smaller language models are more robust to noise, while larger language models are easily distracted.
Abstract: Large language models (LLM) have emerged as a powerful tool for many AI problems and are deeply involved in many aspects of human activity. One important emergent ability is in-context learning (ICL), where LLM can perform well on unseen tasks based on a brief series of task examples without necessitating any adjustments to the model's parameters. Many works trying to study ICL and one recent interesting counter-intuitive observation is that different scale language models may have different ICL behaviors. Despite the tremendous success made by ICL, why different ICL behaviors remains a mystery. In this work, we are trying to answer this question. As a limited understanding of the ICL mechanism, we study a simplified setting, one-layer single-head linear self-attention network pretrained on linear regression in-context task. We characterize language model scale as the rank of key and query matrix in attention. We show that smaller language models are more robust to noise, while larger language models are easily distracted, leading to different ICL behaviors. We also conduct ICL experiments utilizing the LLaMA model families. The results are consistent with previous work and our analysis.
Submission Number: 39
Loading