Towards Understanding Self-Pretraining for Sequence Classification

Omar Coser; Antonio Orvieto

Towards Understanding Self-Pretraining for Sequence Classification

Omar Coser, Antonio Orvieto

Published: 10 Jun 2025, Last Modified: 27 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Attention, LRA, pretraining

TL;DR: Investigation of Self-Pre-Training on LRA

Abstract: It was recently shown by Amos et al. (2023) that to boost test accuracy of transformer models on sequence classification, it can be highly effective to first pretrain with a masked token prediction objective on exactly the same data (self-pretraining, SPT). While the focus of Amos et al. (2023) is to show that transformers – and not only state-space models (SSMs, like S4) – can perform well on the Long-Range Arena (LRA, a collection of challenging synthetic sequence classification tasks), their finding is intriguing from a more fundamental perspective. Indeed, even though it can be easily claimed that the observed gains come from the benefits of data-driven initialization and pretraining inductive biases, it is unclear which precise mechanism unlocks performance and why standard supervised learning can fail. To better understand this intriguing phenomenon, we replicate and ablate the results of Amos et al. (2023). We show that substantial gains can be observed even at an extremely small scale, using a self-pretraining pipeline that requires little extra compute. We further identify in the attention mechanism weights the source of SPT improved performance. We hope our insights lead to future investigations around SPT, and that our work exposes this unusual yet promising technique for data-scarce learning to a broader audience.

Submission Number: 22

Loading