Connecting Pre-trained Language Model and Downstream Task via Properties of Representation

Published: 21 Sept 2023, Last Modified: 02 Nov 2023NeurIPS 2023 posterEveryoneRevisionsBibTeX
Keywords: language model representation, downstream performance, deep learning theory
TL;DR: We empirically find a structure in representations from big language models and theoretically proves it helps transferring pre-training performance to downstream task performances.
Abstract: Recently, researchers have found that representations learned by large-scale pre-trained language models are useful in various downstream tasks. However, there is little theoretical understanding of how pre-training performance is related to downstream task performance. In this paper, we analyze how this performance transfer depends on the properties of the downstream task and the structure of the representations. We consider a log-linear model where a word can be predicted from its context through a network having softmax as its last layer. We show that even if the downstream task is highly structured and depends on a simple function of the hidden representation, there are still cases when a low pre-training loss cannot guarantee good performance on the downstream task. On the other hand, we propose and empirically validate the existence of an ``anchor vector'' in the representation space, and show that this assumption, together with properties of the downstream task, guarantees performance transfer.
Supplementary Material: pdf
Submission Number: 9913