Mitigating Forgetting in Adapting Pre-trained Language Models to Text Processing Tasks via Consistency Alignment
Track: Semantics and knowledge
Keywords: Catastrophic Forgetting, Consistency Alignment, Pre-training Knowledge, Task-specific Knowledge, Auxiliary Model
Abstract: There are a large number of text processing tasks in web applications, such as sentiment classification, summary extraction, and question answering. Recently, fine-tuning pre-trained language models (PLMs) to adapt to downstream text-processing tasks has attracted much attention. However, due to the differences in data, model, and tasks between the pre-training and fine-tuning processes, the fine-tuning process may suffer from catastrophic forgetting of pre-training knowledge, which may implicitly limit the model’s performance and generalization ability. To address these challenges, we propose a novel dual-model framework, termed as \emph{co}nsistency \emph{a}l\emph{i}gnment (CoAi). The insight of CoAi lies in building an auxiliary model that simulates the distribution of pre-training knowledge in real-time according to the current task, and co-training the task-specific model and the auxiliary model to balance the pre-training knowledge and task-specific knowledge during fine-tuning. Specifically, the auxiliary model is constructed on-the-fly to maintain the pre-training knowledge. Subsequently, CoAi simulates the pre-training process by performing distributional exploration in the parameter space, which is built upon our novel insight into the transformation between data and model parameter space. However, the objectives leveraged to construct the auxiliary model lead to the misalignment between the pre-training and task-specific knowledge. To alleviate the inconsistency, we employ an auxiliary variable to align the prediction distribution of the task-specific and the auxiliary models, inspired by constrastive clustering. We validate the effectiveness of CoAi on ten classic classification tasks and three generation tasks, showing consistent and significant improvements compared with state-of-the-art methods.
Submission Number: 2018
Loading