Abstract: In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latent **task/function vector** in LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically grounded *hierarchical* concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights.
Lay Summary: Large language models (LLMs) can infer the underlying task or function from a few question-answer (QA) demostration pairs in a prompt and apply it to a new query — a capability known as in-context learning (ICL). But how do they actually do this? Recent work suggests that LLMs form internal “task vectors” representing the function to be performed, and for factual recall tasks, they solve ICL problems using simple vector arithmetic — reminiscent of how Word2Vec worked with static word embeddings.
We build on these observations and offer the first mathematical framework that explains how this vector-based mechanism works. Based on recent insights into how transformers represent hierarchical concepts, we show that when trained via gradient descent on question–answer data using cross-entropy loss, transformers can retrieve task vectors and perform factual recall through vector addition within their residual pathways. Our analysis further reveals a key difference in training data: while training on ICL-style data often leads to harmful overfitting on low-level task-specific patterns, QA data helps the model better capture the high-level task itself. This provides a theoretical explanation for why QA data has been especially effective in enhancing factual recall abilities in LLMs.
Our theory not only explains this emergent behavior but also demonstrates strong generalization across concept recombination and distribution shifts, outperforming static embedding models. These results serve as preliminary steps to theoretically justify emerging practices using task vector arithmetics such as concept erasure, alleviating forgetting , model editing and merging.
Primary Area: Theory->Deep Learning
Keywords: Task Vector; Residual Transformer; Factual-recall In-Context Learning;
Submission Number: 15363
Loading