Difference-Masking: Choosing What to Mask in Continued Pretraining

Alex Wilf; Syeda Nahida Akter; Leena Mathur; Paul Pu Liang; Sheryl Mathew; Mengrou Shou; Eric Nyberg; Louis-Philippe Morency

Difference-Masking: Choosing What to Mask in Continued Pretraining

Alex Wilf, Syeda Nahida Akter, Leena Mathur, Paul Pu Liang, Sheryl Mathew, Mengrou Shou, Eric Nyberg, Louis-Philippe Morency

Published: 23 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Machine Learning for NLP

Submission Track 2: Efficient Methods for NLP

Keywords: Machine Learning, Self-Supervised Learning, Multimodal, NLP

TL;DR: A novel strategy for deciding what to mask during continued pretraining to do well on various downstream tasks.

Abstract: The self-supervised objective of masked prediction has led to promising performance gains on a variety of downstream tasks. However, while most approaches randomly mask tokens, there is strong intuition that deciding what to mask can substantially improve learning outcomes. We investigate this in continued pretraining setting in which pretrained models continue to pretrain on domain-specific data before performing some downstream task. We introduce Difference-Masking, a masking strategy that automatically chooses what to mask during continued pretraining by considering what makes a task domain different from the pretraining domain. Empirically, we find that Difference-Masking outperforms baselines on continued pretraining settings across four diverse language-only and multimodal video tasks.

Submission Number: 2107

Loading