Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing

07 May 2026 (modified: 09 May 2026)ICML 2026 Workshop CoLoRAI SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: representation dimensionality, high-dimensional statistics, generalisation, pretraining, linear probing
TL;DR: A high-dimensional analysis providing precise conditions under which learning compressed versus highly-detailed representations during pretraining aids downstream generalisation.
Abstract: Learning reusable representations from unlabelled data is central to modern training pipelines, where large pretrained models are adapted to downstream tasks through fine-tuning or linear probing. We study this process in a tractable two-stage setting: representations are learned via principal component analysis on unlabelled data and reused for downstream linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error as functions of representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly shape downstream generalisation and that the optimal representation size depends critically on the data regime: with abundant pretraining data but limited downstream supervision, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, our results identify when compression during pretraining improves downstream generalisation.
Submission Number: 89
Loading