Abstract: Masked Latent Semantic Modeling (MLSM) is a pre-training objective which – in contrast to masked language modeling (MLM) – changes the objective of pre-training from the reconstruction of the exact word forms to their latent semantic properties (LSPs). The LSPs are determined by performing sparse coding based on the hidden token representations derived from an auxiliary model. In this paper, we identify and carefully evaluate previously unexplored important properties of MLSM pre-training. Based on the results of our rigorous experiments, we formulate a series of recommendations and best practices regarding MLSM pre-training for improving its efficiency. Among other recommendations, we propose a recipe for choosing the layer of the auxiliary model to determine the LSPs from, such that we can reduce the costs of pre-training MLSM pre-training, while maintaining or even surpassing the downstream fine-tuning capabilities of the resulting model. We also provide an improved implementation of MLSM, which reduces its computational requirements expressed in FLOPS by 33%. Besides the improved computational requirements, MLSM comes with better fine-tuning transferability, i.e., in our experience, the fine-tuning performance of MLSM pre-trained model checkpoints is on par or better than that of alternatively pre-trained models for twice the update steps. We release our code for reproducing our experiments at github.com/[MASK]
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission:
- The abstract has been rewritten in order to meet the remark from Reviewer Lf8n that the original abstract was not informative enough
- The description of MLSM in §3 have been vastly modified for improving clarity and providing more details as requested by Reviewer Lf8n
- based on the comments of Reviewer jaDw, §4.2.1 has been expanded with additional experiments and methodological novelties regarding the selection of the layer from the auxiliary model to be used
- Additional details on FLOPS calculation has been added to §4.2.3 as requested by Reviewer Lf8n
- Subsection 4.3.2 has been added to the manuscript with scaled up experiments (that are currently still in progress) following the remarks of Reviewer m4ph
- For our multi-task learning setting, we added new results for the $\kappa=2$ case, as asked by Reviewer Lf8n
- We have added a future work section on applying the proposed pre-training approach for vision models, which was recommended by Reviewer jaDw (this modification is also meant to address the remark from Reviewer Lf8n, asking for motivation how our findings might be of additional interest beyond 'regular' MLSM)
- We have refined the wording, e.g., instead of better sample efficiency, we refer to improved downstream transferability throughout the manuscript
Submission Number: 4699