***FastDoc***: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Abhilash Nandy; Manav Nitin Kapadnis; Sohan Patnaik; Yash Parag Butala; Pawan Goyal; Niloy Ganguly

FastDoc: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Abhilash Nandy, Manav Nitin Kapadnis, Sohan Patnaik, Yash Parag Butala, Pawan Goyal, Niloy Ganguly

Published: 11 Jun 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In this paper, we propose FastDoc (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around 1,000, 4,500, and 500 times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. The reduced training time does not lead to a deterioration in performance. In fact we show that FastDoc either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. Thus, unlike baselines, FastDoc shows a negligible drop in performance on open domain.

Submission Length: Long submission (more than 12 pages of main content)

Changes Since Last Submission: We addressed the suggestions for improvements by the Action Editor and the Reviewers: 1. Limitations: Applicability of the proposed model to decoder-only models: Main Paper - limitations 2. Limitations: Applicability of the proposed model when downstream tasks are generation tasks: Main Paper - limitations 3. Tone down (or remove) the claim that the method is faster than SciBert, which starts from scratch, while the method starts from Bert.: Main Paper - Section 6 4. Discuss what is the reason for not training one model that works across all domains: Main Paper - Footnote Number 3 5. add more citations in related work to supervised pretraining (e.g. CLIP): Main Paper - Prior Art 6. Elaboration on the Catastrophic Forgetting setting: Main Paper- Section 7.1 7. why not NT-Xent loss function?: Main Paper - Section 2 (under "Contrastive Learning using document similarity labels.") 8. why margin of 1 in contrastive loss?: Main Paper - Section 2 below equation 1 9. is primary category enough for doc similarity in scientific domain?: Main Paper - Section 5.2 10. Mentioning the Specific URL for the SciBERT checkpoint used for further fine-tuning in our paper: Main Paper Footnote Number 10 11. Customer Support Domain model improves open-domain results, although the improvement is minor: Main Paper - Section 7.1 12. Regarding open-domain result improvement, contrastive learning in pre-training (which learns doc. similarity) improves performance on tasks that require predicting relation between sentence pairs, like STS, QNLI, MRPC.: Main Paper - Section 7.1 13. Although FastDoc uses extra document information, it does not use the information derived from MLM during continual pre-training that many other conventional domain-specific baselines use. Therefore, we maintain that our approach is both equitable and innovative when compared to the baselines.: Main Paper - after Section 5.3 - "Summary of Experiments and Results 14. Choice of GPU-Hours as a metric for measuring pre-training compute: Appendix - Section F 15. Encoder trained in FastDoc takes sentence embs as inputs, while EManuals-BERT encoder takes in token embs. There are 37.3 tokens/sentence in pre-training corpus=> 37.3x lesser samples for FastDoc wrt EManuals-BERT for same text, reducing compute from 33.3x to 1000x (33.3x37 is ~1000): Main Paper - Footnote Number 11 16. Using model of larger context window instead of FastDoc suring continual pre-training: Main Paper - Footnote Number 9 17. Why FastDoc answers on numerical entities/locations? (Qualitative Analysis): Appendix - Table 21 18. Does FastDoc only work for encoder-only models?: Main Paper - Limitations 19. Tab 6 suggests main reason for reduced compute of FastDoc is architectural change: Main Paper - Section 6 - Last Sentence 20. Added the term "continual" in title, and more repeatedly in abstract 21. Added Linkbert and CDLM Baselines in Customer Support (Section 5.1) and Legal (Section 5.3) Domains 22. Added Section 7.3 on using larger model backbone for FastDoc across all 3 domains

Video: https://drive.google.com/file/d/1ENnNnCpv_QCyUHmsAaDU0eteTOHIAwwg/view?usp=sharing

Code: https://github.com/manavkapadnis/FastDoc-Fast-Pre-training-Technique/

Supplementary Material: zip

Assigned Action Editor: ~Alessandro_Sordoni1

Submission Number: 2054

Loading