Track: long paper (up to 8 pages)
Keywords: large language models, transformers, in-context learning
TL;DR: We argue that in-context learning tasks can be effectively represented by a weighted sum of the attention activations, where the weights are causally optimized using gradient descent in scenarios where the LLM experiences shortcomings.
Abstract: Large language models (LLMs) excel in in-context learning (ICL), adapting to new tasks via example-based prompts without parameter updates. Despite their capabilities, the internal representation and generalization of ICL tasks remain elusive. We introduce a method that encodes task information in ICL prompts by computing a single vector embedding as a weighted sum of the transformer's attention heads, optimized via gradient descent to address performance challenges. Our results indicate that current methods fail to generalize numeric tasks beyond trained lengths, exhibiting significant degradation with even minimal exceedance. Our approach not only addresses these shortcomings but also enhances performance across numeric and linguistic tasks, maintaining high task fidelity. This demonstrates our method's efficacy in deriving task-specific information from in-context demonstrations, suggesting broader applications for LLMs in ICL.
Submission Number: 31
Loading