TL;DR: We develop the contexture theory that clarifies that representations are learned from the association between the input and a context variable.
Abstract: Despite the empirical success of foundation models, we do not have a systematic characterization of the representations that these models learn.
In this paper, we establish the contexture theory.
It shows that a large class of representation learning methods can be characterized as learning from the association between the input and a context variable. Specifically, we show that many popular methods aim to approximate the top-d singular functions of the expectation operator induced by the context, in which case we say that the representation learns the contexture.
We demonstrate the generality of the contexture theory by proving that representation learning within various learning paradigms -- supervised, self-supervised, and manifold learning -- can all be studied from such a perspective.
We prove that representations that learn the contexture are optimal on those tasks that are compatible with the context.
One important implication of our theory is that once the model is large enough to approximate the top singular functions, scaling up the model size yields diminishing returns, so further improvement requires better contexts.
To this end, we study how to evaluate a context without knowing the downstream tasks. We propose a metric and show by experiments that it correlates well with the actual performance of the encoder on many real datasets.
Lay Summary: Foundation models have achieved remarkable empirical success in recent years. Their success largely results from the scaling law -- increasing the model size leads to better performance. However, two questions have not been answered to a satisfactory extent. First, what representations do foundation models learn, and why are these representations useful for a variety of downstream tasks? Second, can increasing the model size always improve the performance?
We develop the contexture theory, which shows that foundation models learn representations from the association between the input X and a context variable A. Specifically, they aim to extract the top-d eigenspace of a specific operator called the expectation operator induced by the joint distribution of X and A. Such a representation is useful for a downstream task if the task is compatible with the context. This theory implies that increasing the model size brings the representation closer to the top-d eigenspace, and when they are close enough, further scaling has little benefit.
Hence, creating better contexts is essential for further improving pretraining. To this end, we study how to evaluate the usefulness of a context. We propose a metric that quantitatively measures context usefulness. The metric only depends on the spectrum of the expectation operator, and does not need any knowledge of the downstream task. We show that this metric correlates well with the actual error of the pretrained encoder.
Link To Code: https://colab.research.google.com/drive/1GdJ0Yn-PKiKfkZIwUuon3WpTpbNWEtAO?usp=sharing
Primary Area: General Machine Learning->Representation Learning
Keywords: representation learning, pretraining, learning theory, foundation models, scaling law
Submission Number: 5109
Loading