Keywords: Natural Language Processing, Probing Study
Abstract: Large pre-trained language models have helped to achieve state of the art on a variety of NLP tasks, nevertheless, they still suffer from forgetting when incrementally learning a series of sequential tasks. To alleviate this problem, recent works propose several models enhanced by sparse experience replay and local adaption, which yield satisfactory performance. However, in this paper we find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay. To verify the ability of BERT to maintain old knowledge, we adopt and re-finetune single-layer probe networks with the parameters of BERT fixed. We investigate the models on two typical kinds of NLP tasks, text classification and extractive question answering. And our experiments reveal that BERT can actually generate high quality representations for previous tasks in a long term, under extremely sparse replay or even no replay. We further introduce a series of methods to interpret the mechanism of forgetting and how memory rehearsal plays a significant role in task incremental learning, which bridges the gap between our new discovery and previous studies about catastrophic forgetting. Additionally, we provide both quantified and visualized results demonstrating that the representation space of BERT is always topologically organised, which guarantees its performance.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)