How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression

Published: 18 Jun 2024, Last Modified: 03 Jul 2024TF2M 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: in-context learning, transformers, deep learning theory, learning theory
Abstract: In this study, we investigate how a trained multi-head transformer performs in-context learning on sparse linear regression. We experimentally discover distinct patterns in multi-head utilization across layers: multiple heads are essential in the first layer, while subsequent layers predominantly utilize a single head. We propose that the first layer preprocesses input data, while later layers execute simple optimization steps on the preprocessed data. Theoretically, we prove such a preprocess-then-optimize algorithm can outperform naive gradient descent and ridge regression, corroborated by experiments. Our findings provide insights into the benefits of multi-head attention and the intricate mechanisms within trained transformers.
Submission Number: 16
Loading