Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora

29 May 2024 (modified: 13 Nov 2024)Submitted to NeurIPS 2024 Track Datasets and BenchmarksEveryoneRevisionsBibTeXCC BY 4.0
Keywords: domain-specific knowledge, data collection, large language model
Abstract: Large language models(LLMs) have demonstrated remarkable potential in various tasks, however, there remains a significant lack of open-source models and data for specific domains. Previous work has primarily focused on manually specifying resources and collecting high-quality data for specific domains, which is extremely time-consuming and labor-intensive. To address this limitation, we introduce large models into the data collection pipeline to guide the generation of domain-specific information and retrieve relevant data from Common Crawl(CC), a large public corpus. We called this method as Query of CC. It not only collects data related to domain-specific knowledge but also mines the data with potential reasoning procedures from the public corpus. By applying this method, we have collected a knowledge domain-related dataset named Knowledge Pile, which covers four main domains, including the sciences, humanities, and other categories. Through the analysis of Knowledge Pile, Query of CC can effectively retrieve relevant data from the covered knowledge domains and significantly enhance the performance in tests of mathematical and knowledge-related reasoning abilities. We have open-sourced our data on HuggingFace to promote academic progress in knowledge reasoning capabilities.
Supplementary Material: pdf
Submission Number: 1596
Loading