Large Language Models Encode Geoscience Knowledge

ACL ARR 2024 June Submission4832 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In this study, we introduce A Data-centric Recipe for advancing the application of Large Language Models (LLMs) in the realm of geoscience. Leveraging the versatility of LLMs and their potential for interdisciplinary applications, particularly in Artificial Intelligence for Science (AI4S), we propose a methodology to tailor an open-source LLM to the geoscience domain, with potential for broader interdisciplinary use. This involves further pre-training the model with a comprehensive geoscience text corpus and fine-tuning it using a custom instruction tuning dataset. Our efforts culminate in multiple size of LLM specialized for geoscience tasks. Through rigorous evaluation on geoscience examinations and open-domain questions, our model exhibits state-of-the-art performance across a diverse array of Natural Language Processing tasks within the geoscience domain.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, benchmarking, NLP datasets, automatic evaluation of datasets
Languages Studied: English
Submission Number: 4832
Loading