Learning to Improve Out-of-Distribution Generalization via Self-Adaptive Language Masking

Shuoran Jiang, Youcheng Pan, Qingcai Chen, Yang Xiang, Xiangping Wu

Published: 01 Jan 2024, Last Modified: 14 Jul 2025IEEE ACM Trans. Audio Speech Lang. Process. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Although the pre-trained Transformers learned general linguistic knowledge from large-scale corpus, they still over-fit on the lexical biases when fine-tuning on specific datasets. This problem limits the generalizability of pre-trained models, particularly when learning over out-of-distribution (OOD) data. To address this issue, this paper proposes a self-adaptive language masking (AdaLMask) paradigm to fine-tune the pre-trained Transformers. AdaLMask obviates lexical biases by eliminating the dependence on semantically inessential words. Specifically, AdaLMask learns a Gumbel-Softmax distribution to determine the desired masking positions, and the distribution parameters are optimized via a representation-invariant (RInv) objective to ensure the masked positions are semantically lossless. Four natural language processing tasks are chosen to evaluate the effectiveness of the proposed method on the robustness of lexical biases and OOD generalization. All empirical results demonstrate that the AdaLMask paradigm substantially improves the OOD generalization of pre-trained Transformers.