***FastDoc***: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

TMLR Paper2054 Authors

16 Jan 2024 (modified: 21 Apr 2024)Decision pending for TMLREveryoneRevisionsBibTeX
Abstract: In this paper, we propose ***FastDoc*** (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around 1,000, 4,500, and 500 times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. The reduced training time does not lead to a deterioration in performance. In fact we show that ***FastDoc*** either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. Thus, unlike baselines, ***FastDoc*** shows a negligible drop in performance on open domain.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: 1. Added the term "continual" in title, and more repeatedly in abstract 2. Added Linkbert and CDLM Baselines in Customer Support (Sec 5.1) and Legal (Sec 5.3) Domains 3. Added details on SciBERT in Section 6 4. Added Section 7.3 on using larger model backbone for FastDoc across all 3 domains
Assigned Action Editor: ~Alessandro_Sordoni1
Submission Number: 2054
Loading